date()
## [1] "Mon Dec 4 12:42:36 2023"
Heads-up: I am new with R. During this first week, I
studied the 4 first chapters of the book made on R studio and watched
the given tutorial videos.
I have learned how to:
- set my directory
- use the R project mode, the R markdown file and github commit
- install and use libraries
- identify and use the different data types including the factors
- use the pipe %>% (it means “and then…”)
- open data set, read data set, read the structure of the data set
- select rows, columns and match text or rename column name
- filter data based on variable values
- create a table / tibble
- create an object that is reusable later
- use the time format
- add columns with text, int, calculations, categories etc.
- group data with one or many variables to then perform
calculation
- combine all data types to create a string with the function
paste()
- check the NA and remove the NA to do calculations
- summarize to sum up data per variable value
- manipulate table from wide to long and long to wide
- combine 2 tables (functions: bind, inner_join, full_join, left_join,
right_join)
- use the most common arithmetic functions and use percentage
- rank and reorder value
- create plots (bar, col, line, point, box, histogram)
- play with the different style and aesthetic possibilities
- add labels to plots
- combine plots together
With this tool I discovered how easy and fast it can be to manipulate
and visualize data from a big data set.
Conclusion of Week 1:
The book is really well made. The style is easy to follow and
understand. We can learn at our own pace and try the examples in R
studio to assimilate the concepts well.
I am new with R and I find it hard to learn so many new different
concepts in a week. I would have liked to have more exercises, without
the solutions directly written under and with a tool to correct our
lines of code. Finally, I understand it is not necessary to know by
heart every function, but I would like to understand well the
capabilities of this tool by practicing and doing more real life
exercises.
Lines of code to summarize learning in week 1:
# Get the data set
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
gbd_full <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/RHDS/master/data/global_burden_disease_cause-year-sex-income.csv")
## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cause, sex, income
## dbl (2): year, deaths_millions
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# This is a code I wrote to know the pourcentage of deaths per cause for each year.
exercise4 <- gbd_full %>%
group_by(year, cause) %>%
summarise(total_per_cause= sum(deaths_millions)) %>%
group_by(year) %>%
mutate(total_per_year =sum(total_per_cause)) %>%
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
select(year,cause,percentage) %>%
pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
exercise4
## # A tibble: 7 × 4
## # Groups: year [7]
## year `Communicable diseases` Injuries `Non-communicable diseases`
## <dbl> <chr> <chr> <chr>
## 1 1990 33% 9% 58%
## 2 1995 31% 9% 60%
## 3 2000 29% 9% 62%
## 4 2005 27% 9% 64%
## 5 2010 24% 9% 67%
## 6 2015 20% 8% 72%
## 7 2017 19% 8% 73%
# This is a code I wrote to know the number of deaths per sex and income and see the ratio Male / Female for each income type.
gbd_full %>%
filter(year ==1990) %>%
group_by(sex,income) %>%
summarize(deaths_per_sex_income = sum(deaths_millions)) %>%
pivot_wider(names_from = sex, values_from = deaths_per_sex_income) %>%
mutate(diff_M_F = Male/Female)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## income Female Male diff_M_F
## <chr> <dbl> <dbl> <dbl>
## 1 High 4.14 4.46 1.08
## 2 Low 2.22 2.57 1.16
## 3 Lower-Middle 8.47 9.83 1.16
## 4 Upper-Middle 6.68 7.95 1.19
date()
## [1] "Mon Dec 4 12:42:39 2023"
# set directory
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")
Thoughts about this week 2:
After reading all the chapters 1-7, I am now more confident to use R studio. I also understand the language better and I can do research on the web to use new function that I did not know.
It is very exciting to see how efficient is this tool and to think about all the analyzes we can do. I am an open university student and I can already see how to use this tool at work :).
Exercise: WRANGLING Please find the script file in the Github: create_learning2014_week2
Exercise: DATA ANALYSIS
# Using the table from the course.
csv_table_read <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep = ",", header = T)
# library
library("tidyverse")
library("finalfit")
library("broom")
csv_table_read
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
## 7 M 50 3.5 3.833333 2.250 1.916667 21
## 8 F 37 2.9 3.250000 4.000 2.833333 31
## 9 M 37 3.8 4.333333 4.250 2.166667 24
## 10 F 42 2.1 4.000000 3.500 3.000000 26
## 11 M 37 3.9 3.583333 3.625 2.666667 31
## 12 F 34 3.8 3.833333 4.750 2.416667 31
## 13 F 34 2.4 4.250000 3.625 2.250000 23
## 14 F 34 3.0 3.333333 3.500 2.750000 25
## 15 M 35 2.6 4.166667 1.750 2.333333 21
## 16 F 33 4.1 3.666667 3.875 2.333333 31
## 17 F 32 2.6 4.083333 1.375 2.916667 20
## 18 F 44 2.6 3.500000 3.250 2.500000 22
## 19 M 29 1.7 4.083333 3.000 3.750000 9
## 20 F 30 2.7 4.000000 3.750 2.750000 24
## 21 M 27 3.9 3.916667 2.625 2.333333 28
## 22 M 29 3.4 4.000000 2.375 2.416667 30
## 23 F 31 2.7 4.000000 3.625 3.000000 24
## 24 F 37 2.3 3.666667 2.750 2.416667 9
## 25 F 26 3.7 3.666667 1.750 2.833333 26
## 26 F 26 4.4 4.416667 3.250 3.166667 32
## 27 M 30 4.1 3.916667 4.000 3.000000 32
## 28 F 33 3.7 3.750000 3.625 2.000000 33
## 29 F 33 2.5 3.250000 2.875 3.500000 29
## 30 M 28 3.0 3.583333 3.000 3.750000 30
## 31 M 26 3.4 4.916667 1.625 2.500000 19
## 32 F 27 3.2 3.583333 3.250 2.083333 23
## 33 F 25 2.0 2.916667 3.500 2.416667 19
## 34 F 31 2.4 3.666667 3.000 2.583333 12
## 35 M 20 4.2 4.500000 3.250 1.583333 10
## 36 F 39 1.6 4.083333 1.875 2.833333 11
## 37 M 38 3.1 3.833333 4.375 1.833333 20
## 38 M 24 3.8 3.250000 3.625 2.416667 26
## 39 M 26 3.8 2.333333 2.500 3.250000 31
## 40 M 25 3.3 3.333333 1.250 3.416667 20
## 41 F 30 1.7 4.083333 4.000 3.416667 23
## 42 F 25 2.5 2.916667 3.000 3.166667 12
## 43 M 30 3.2 3.333333 2.500 3.500000 24
## 44 F 48 3.5 3.833333 4.875 2.666667 17
## 45 F 24 3.2 3.666667 5.000 2.416667 29
## 46 F 40 4.2 4.666667 4.375 3.583333 23
## 47 M 25 3.1 3.750000 3.250 2.083333 28
## 48 F 23 3.9 3.416667 4.000 3.750000 31
## 49 F 25 1.9 4.166667 3.125 2.916667 23
## 50 F 23 2.1 2.916667 2.500 2.916667 25
## 51 M 27 2.5 4.166667 3.125 2.416667 18
## 52 M 25 3.2 3.583333 3.250 3.000000 19
## 53 M 23 3.2 2.833333 2.125 3.416667 22
## 54 F 23 2.6 4.000000 2.750 2.916667 25
## 55 F 23 2.3 2.916667 2.375 3.250000 21
## 56 F 45 3.8 3.000000 3.125 3.250000 9
## 57 F 22 2.8 4.083333 4.000 2.333333 28
## 58 F 23 3.3 2.916667 4.000 3.250000 25
## 59 M 21 4.8 3.500000 2.250 2.500000 29
## 60 M 21 4.0 4.333333 3.250 1.750000 33
## 61 F 21 4.0 4.250000 3.625 2.250000 33
## 62 F 21 4.7 3.416667 3.625 2.083333 25
## 63 F 26 2.3 3.083333 2.500 2.833333 18
## 64 F 25 3.1 4.583333 1.875 2.833333 22
## 65 F 26 2.7 3.416667 2.000 2.416667 17
## 66 M 21 4.1 3.416667 1.875 2.250000 25
## 67 F 23 3.4 3.416667 4.000 2.833333 28
## 68 F 22 2.5 3.583333 2.875 2.250000 22
## 69 F 22 2.1 1.583333 3.875 1.833333 26
## 70 F 22 1.4 3.333333 2.500 2.916667 11
## 71 F 23 1.9 4.333333 2.750 2.916667 29
## 72 M 22 3.7 4.416667 4.500 2.083333 22
## 73 M 23 3.2 4.833333 3.375 2.333333 21
## 74 M 24 2.8 3.083333 2.625 2.416667 28
## 75 F 22 4.1 3.000000 4.125 2.750000 33
## 76 F 23 2.5 4.083333 2.625 3.250000 16
## 77 M 22 2.8 4.083333 2.250 1.750000 31
## 78 M 20 3.8 3.750000 2.750 2.583333 22
## 79 M 22 3.1 3.083333 3.000 3.333333 31
## 80 M 21 3.5 4.750000 1.625 2.833333 23
## 81 F 22 3.6 4.250000 1.875 2.500000 26
## 82 F 23 2.6 4.166667 3.375 2.416667 12
## 83 M 21 4.4 4.416667 3.750 2.416667 26
## 84 M 22 4.5 3.833333 2.125 2.583333 31
## 85 M 29 3.2 3.333333 2.375 3.000000 19
## 86 F 29 3.9 3.166667 2.750 2.000000 30
## 87 F 21 2.5 3.166667 3.125 3.416667 12
## 88 M 28 3.3 3.833333 3.500 2.833333 17
## 89 F 21 3.3 4.250000 2.625 2.250000 18
## 90 F 30 3.0 3.833333 3.375 2.750000 19
## 91 F 21 2.9 3.666667 2.250 3.916667 21
## 92 M 23 3.3 3.833333 3.000 2.333333 24
## 93 F 21 3.3 3.833333 4.000 2.750000 28
## 94 F 21 3.5 3.833333 3.500 2.750000 17
## 95 F 20 3.6 3.666667 2.625 2.916667 18
## 96 M 22 3.7 4.333333 2.500 2.083333 17
## 97 M 21 4.2 3.750000 3.750 3.666667 23
## 98 M 21 3.2 4.166667 3.625 2.833333 26
## 99 F 20 5.0 4.000000 4.125 3.416667 28
## 100 M 22 4.7 4.000000 4.375 1.583333 31
## 101 F 20 3.6 4.583333 2.625 2.916667 27
## 102 F 20 3.6 3.666667 4.000 3.000000 25
## 103 M 24 2.9 3.666667 2.750 2.916667 23
## 104 F 20 3.5 3.833333 2.750 2.666667 21
## 105 F 19 4.0 2.583333 1.375 3.000000 27
## 106 F 21 3.5 3.500000 2.250 2.750000 28
## 107 F 21 3.2 3.083333 3.625 3.083333 23
## 108 F 22 2.6 4.250000 3.750 2.500000 21
## 109 F 25 2.0 3.166667 4.000 2.333333 25
## 110 F 21 2.7 3.083333 3.125 3.000000 11
## 111 F 22 3.2 4.166667 3.250 3.000000 19
## 112 F 25 3.3 2.250000 2.125 4.000000 24
## 113 F 20 3.9 3.333333 2.875 3.250000 28
## 114 M 24 3.3 3.083333 1.500 3.500000 21
## 115 F 20 3.0 2.750000 2.500 3.500000 24
## 116 M 21 3.7 3.250000 3.250 3.833333 24
## 117 F 20 2.5 4.000000 3.625 2.916667 20
## 118 F 20 2.9 3.583333 3.875 2.166667 19
## 119 M 31 3.9 4.083333 3.875 1.666667 30
## 120 F 20 3.6 4.250000 2.375 2.083333 22
## 121 F 22 2.9 3.416667 3.000 2.833333 16
## 122 F 22 2.1 3.083333 3.375 3.416667 16
## 123 M 21 3.1 3.500000 2.750 3.333333 19
## 124 M 22 4.0 3.666667 4.500 2.583333 30
## 125 F 21 3.1 4.250000 2.625 2.833333 23
## 126 F 21 2.3 4.250000 2.750 3.333333 19
## 127 F 21 2.8 3.833333 3.250 3.000000 18
## 128 F 21 3.7 4.416667 4.125 2.583333 28
## 129 F 20 2.6 3.500000 3.375 2.416667 21
## 130 F 21 2.4 3.583333 2.750 3.583333 19
## 131 F 25 3.0 3.666667 4.125 2.083333 27
## 132 M 21 2.8 2.083333 3.250 4.333333 24
## 133 F 24 2.9 4.250000 2.875 2.666667 21
## 134 F 20 2.4 3.583333 2.875 3.000000 20
## 135 M 21 3.1 4.000000 2.375 2.666667 28
## 136 F 20 1.9 3.333333 3.875 2.166667 12
## 137 F 20 2.0 3.500000 2.125 2.666667 21
## 138 F 18 3.8 3.166667 4.000 2.250000 28
## 139 F 21 3.4 3.583333 3.250 2.666667 31
## 140 F 19 3.7 3.416667 2.625 3.333333 18
## 141 F 21 2.9 4.250000 2.750 3.500000 25
## 142 F 20 2.3 3.250000 4.000 2.750000 19
## 143 M 21 4.1 4.416667 3.000 2.000000 21
## 144 F 20 2.7 3.250000 3.375 2.833333 16
## 145 F 21 3.5 3.916667 3.875 3.500000 7
## 146 F 20 3.4 3.583333 3.250 2.500000 21
## 147 F 18 3.2 4.500000 3.375 3.166667 17
## 148 M 22 3.3 3.583333 4.125 3.083333 22
## 149 F 22 3.3 3.666667 3.500 2.916667 18
## 150 M 24 3.5 2.583333 2.000 3.166667 25
## 151 F 19 3.2 4.166667 3.625 2.500000 24
## 152 F 20 3.1 3.250000 3.375 3.833333 23
## 153 F 20 2.8 4.333333 2.125 2.250000 23
## 154 F 17 1.7 3.916667 4.625 3.416667 26
## 155 M 19 1.9 2.666667 2.500 3.750000 12
## 156 F 20 3.5 3.083333 2.875 3.000000 32
## 157 F 20 2.4 3.750000 2.750 2.583333 22
## 158 F 20 2.1 4.166667 4.000 3.333333 20
## 159 F 20 2.9 4.166667 2.375 2.833333 21
## 160 F 19 1.9 3.250000 3.875 3.000000 23
## 161 F 19 2.0 4.083333 3.375 2.833333 20
## 162 F 22 4.2 2.916667 1.750 3.166667 28
## 163 M 35 4.1 3.833333 3.000 2.750000 31
## 164 F 18 3.7 3.166667 2.625 3.416667 18
## 165 F 19 3.6 3.416667 2.625 3.000000 30
## 166 M 21 1.8 4.083333 3.375 2.666667 19
# analyze the structure of the dataset
str(csv_table_read)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# analyze the dimension of the dataset
dim(csv_table_read)
## [1] 166 7
# Missing data? No data is missing.
ff_glimpse(csv_table_read)
## $Continuous
## label var_type n missing_n missing_percent mean sd min
## age age <int> 166 0 0.0 25.5 7.8 17.0
## attitude attitude <dbl> 166 0 0.0 3.1 0.7 1.4
## deep deep <dbl> 166 0 0.0 3.7 0.6 1.6
## stra stra <dbl> 166 0 0.0 3.1 0.8 1.2
## surf surf <dbl> 166 0 0.0 2.8 0.5 1.6
## points points <int> 166 0 0.0 22.7 5.9 7.0
## quartile_25 median quartile_75 max
## age 21.0 22.0 27.0 55.0
## attitude 2.6 3.2 3.7 5.0
## deep 3.3 3.7 4.1 4.9
## stra 2.6 3.2 3.6 5.0
## surf 2.4 2.8 3.2 4.3
## points 19.0 23.0 27.8 33.0
##
## $Categorical
## label var_type n missing_n missing_percent levels_n levels
## gender gender <chr> 166 0 0.0 2 -
## levels_count levels_percent
## gender - -
# summary statistics for each variable
missing_glimpse(csv_table_read)
## label var_type n missing_n missing_percent
## gender gender <chr> 166 0 0.0
## age age <int> 166 0 0.0
## attitude attitude <dbl> 166 0 0.0
## deep deep <dbl> 166 0 0.0
## stra stra <dbl> 166 0 0.0
## surf surf <dbl> 166 0 0.0
## points points <int> 166 0 0.0
# Count per gender and percentage male / female
library("scales")
csv_table_read %>%
count(gender) %>%
mutate(total_percentage = n / nrow(csv_table_read)) %>%
mutate(total_percentage2 = percent(total_percentage))
## gender n total_percentage total_percentage2
## 1 F 110 0.6626506 66%
## 2 M 56 0.3373494 34%
# Mean and median for exercises points, and learning method per gender
summary(csv_table_read)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# The age varies from 17 to 55, mean is 25 and median 22. it suggests that there are some relatively higher values in the dataset
# The attitude varies from 1.4 to 5
# The points are from 7 to 33 and the mean is 22 and the median is 23. It suggests that there are some relatively lower values in the dataset
# we analyze the variables for both genders and females
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(csv_table_read[-1])
# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(csv_table_read, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
# some data shows that there could be correlation between some variables
# Relationship between points and attitudes
csv_table_read %>%
ggplot(aes(x= attitude, y= points)) +
geom_point() +
facet_wrap(~ gender) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
#female model
female_data <- csv_table_read %>%
filter(gender == "F")
View(female_data)
# Fit a multiple linear model for females. Let's check how points are influenced by age, attitude and deep learning approach
female_fitmodel <- lm(points ~ age + attitude + deep, data = female_data)
# In this model I want to check if age, attitude and deep impact points without impacting each other.
# summary of std, p value and
summary(female_fitmodel)
##
## Call:
## lm(formula = points ~ age + attitude + deep, data = female_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.058 -3.263 0.622 4.003 10.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.59029 4.28982 3.168 0.00201 **
## age -0.01743 0.06983 -0.250 0.80338
## attitude 3.40151 0.70837 4.802 5.19e-06 ***
## deep -0.27355 0.96270 -0.284 0.77685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.353 on 106 degrees of freedom
## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1559
## F-statistic: 7.71 on 3 and 106 DF, p-value: 0.0001043
summary(female_fitmodel)$r.squared
## [1] 0.1791183
#"age," "attitude," and "deep" explains about 18% of the variation of "points"
# p value intercept: it is significant as very small (0.002) and seems to play a significant role in the regression model
# baseline of model in 13.59 (estimate), when no factors are taken into account.
# age is not significant and is not correlated with points
# deep is not significant and is not correlated with points
# attitude is significant and it seems to play a significant role on the points.
# for one point increase in the attitude, the points increase by 3.63 (estimate)
# High std shows that the estimate is not so precise. It could due to sample size.
# I decide to drop the deep and the age variables and keep only the attitude.
female_fitmodel2 <- lm(points ~ attitude, data = female_data)
summary(female_fitmodel2)
##
## Call:
## lm(formula = points ~ attitude, data = female_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0557 -3.3486 0.6137 3.9819 10.3668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.194 2.156 5.655 1.29e-07 ***
## attitude 3.389 0.701 4.835 4.44e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 108 degrees of freedom
## Multiple R-squared: 0.1779, Adjusted R-squared: 0.1703
## F-statistic: 23.38 on 1 and 108 DF, p-value: 4.442e-06
tidy(female_fitmodel2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.2 2.16 5.66 0.000000129
## 2 attitude 3.39 0.701 4.83 0.00000444
summary(female_fitmodel2)$r.squared
## [1] 0.1779266
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is still quite low..
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.
# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points
# autoplot doesnt knit.
#autoplot(female_fitmodel)
#autoplot(female_fitmodel2)
# I use plot function and which to get the desired plots.
plot(female_fitmodel,which = c(1,2,5))
plot(female_fitmodel2,which = c(1,2,5))
# we observe non normality at the end and beginning of the line in qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage
# male model
male_data <- csv_table_read %>%
filter(gender == "M")
View(male_data)
summary(male_data)
## gender age attitude deep
## Length:56 Min. :19.0 Min. :1.700 Min. :2.083
## Class :character 1st Qu.:21.0 1st Qu.:3.100 1st Qu.:3.396
## Mode :character Median :24.0 Median :3.400 Median :3.792
## Mean :26.8 Mean :3.443 Mean :3.725
## 3rd Qu.:29.0 3rd Qu.:3.900 3rd Qu.:4.083
## Max. :55.0 Max. :4.800 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 9.00
## 1st Qu.:2.375 1st Qu.:2.312 1st Qu.:20.00
## Median :3.000 Median :2.625 Median :23.50
## Mean :2.964 Mean :2.704 Mean :23.48
## 3rd Qu.:3.531 3rd Qu.:3.167 3rd Qu.:28.25
## Max. :4.500 Max. :4.333 Max. :33.00
# Fit a multiple linear model for males. Let's check how points are influenced by age, attitude and deep learning approach
male_fitmodel <- lm(points ~ age + attitude + deep, data = male_data)
# summary of std, p value and
summary(male_fitmodel)
##
## Call:
## lm(formula = points ~ age + attitude + deep, data = male_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8084 -3.3162 -0.0696 3.2195 9.9927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.21897 6.06857 3.002 0.004112 **
## age -0.16602 0.08456 -1.963 0.054974 .
## attitude 4.31829 1.11699 3.866 0.000309 ***
## deep -1.38378 1.22006 -1.134 0.261916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.271 on 52 degrees of freedom
## Multiple R-squared: 0.2718, Adjusted R-squared: 0.2298
## F-statistic: 6.47 on 3 and 52 DF, p-value: 0.000835
tidy(male_fitmodel)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.2 6.07 3.00 0.00411
## 2 age -0.166 0.0846 -1.96 0.0550
## 3 attitude 4.32 1.12 3.87 0.000309
## 4 deep -1.38 1.22 -1.13 0.262
summary(male_fitmodel)$r.squared
## [1] 0.2718164
# similar results than for the female.
# All variables have a smaller p value than for in the female model.
# rsquare is higher as it explains 27% but it is still quite low. It could be due to the sample size.
# I decide to drop the deep and the age as variables and keep only the attitude.
male_fitmodel2 <- lm(points ~ attitude, data = male_data)
summary(male_fitmodel2)
##
## Call:
## lm(formula = points ~ attitude, data = male_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6535 -2.9073 -0.5121 3.6974 10.2106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.061 3.953 2.292 0.02581 *
## attitude 4.189 1.129 3.711 0.00049 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.411 on 54 degrees of freedom
## Multiple R-squared: 0.2032, Adjusted R-squared: 0.1884
## F-statistic: 13.77 on 1 and 54 DF, p-value: 0.0004897
tidy(male_fitmodel2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.06 3.95 2.29 0.0258
## 2 attitude 4.19 1.13 3.71 0.000490
summary(male_fitmodel)$r.squared
## [1] 0.2718164
# p value is very low, same for the std, so this model is correct and justify the positive relation vs a positive attitude -> more points.
# rsquare is higher as it explains 27% but it is still quite low
# The model doesn't provide a good fit for the data, and a significant portion of the variance is not explained. Is could be due to the sample size.
# autoplot: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# Identify issues with my regression model, such as non-linearity, non-normality, or influential data points
# autoplot doesnt knit !!
# autoplot(male_fitmodel)
# autoplot(male_fitmodel2)
# plot with the plot function
plot(male_fitmodel,which = c(1,2,5))
plot(male_fitmodel2,which = c(1,2,5))
#The red line in residuals vs fitted stays quite close to the 0 line which is good
# both models show non normality. it is observed at the beginning of the qq plot
# both models show that there are some points that are high leverage indicated on the residuals vs leverage
test_fit1 <- csv_table_read %>%
lm(points ~ deep, data = .)
library(ggfortify)
summary(test_fit1)
##
## Call:
## lm(formula = points ~ deep, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6913 -3.6935 0.2862 4.9957 10.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1141 3.0908 7.478 4.31e-12 ***
## deep -0.1080 0.8306 -0.130 0.897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared: 0.000103, Adjusted R-squared: -0.005994
## F-statistic: 0.01689 on 1 and 164 DF, p-value: 0.8967
tidy(test_fit1) # p value is small and significant
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.1 3.09 7.48 4.31e-12
## 2 deep -0.108 0.831 -0.130 8.97e- 1
summary(test_fit1)$r.squared # too low
## [1] 0.0001029919
test_fit2 <- csv_table_read %>%
lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit2)
##
## Call:
## lm(formula = points ~ deep * gender, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3247 -3.3338 0.3369 4.6242 10.6787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.36387 3.91828 5.708 5.33e-08 ***
## deep -0.01001 1.06032 -0.009 0.992
## genderM 2.67476 6.41719 0.417 0.677
## deep:genderM -0.40787 1.71487 -0.238 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared: 0.00922, Adjusted R-squared: -0.009127
## F-statistic: 0.5025 on 3 and 162 DF, p-value: 0.6811
tidy(test_fit2) # p value is small and significant
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 22.4 3.92 5.71 0.0000000533
## 2 deep -0.0100 1.06 -0.00944 0.992
## 3 genderM 2.67 6.42 0.417 0.677
## 4 deep:genderM -0.408 1.71 -0.238 0.812
summary(test_fit2)$r.squared # too low
## [1] 0.009220341
# Female vs Male participants
csv_table_read %>%
ggplot(aes(x=gender)) +
geom_bar()
# age chart and gender per age
csv_table_read %>%
ggplot(aes(x= age, fill = gender)) +
geom_bar()
# age chart distribution per gender
csv_table_read %>%
ggplot(aes(x= age)) +
facet_grid(~gender) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# age box plot distribution per gender
csv_table_read %>%
ggplot(aes(x= gender, y=age)) +
geom_boxplot()
# relationship and distribution between age, points, and gender
csv_table_read %>%
ggplot(aes(y = points, x = age, colour = gender)) +
geom_point() +
labs(title = "Distribution of points per age and gender")
# with this data we can observe the different age points that drives the mean up (vs the median).
# Distribution of the points per gender - histogram
csv_table_read %>%
ggplot(aes(x = points)) +
geom_histogram() +
facet_grid(~gender) +
labs(title = "Histogram of points by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Distribution of the points per gender - boxplot
csv_table_read %>%
ggplot(aes(y = points, x = gender, colour = gender)) +
geom_boxplot() +
labs(title = "Boxplot of points by Gender")
#QQ plot - points per gender
csv_table_read %>%
ggplot(aes(sample = points)) +
geom_qq() +
geom_qq_line(colour = "blue") +
facet_grid(~gender)
# mean points per gender - this is not significant
csv_table_read %>%
t.test(points ~ gender, data = .)
##
## Welch Two Sample t-test
##
## data: points by gender
## t = -1.1832, df = 107.84, p-value = 0.2393
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -3.0896905 0.7799502
## sample estimates:
## mean in group F mean in group M
## 22.32727 23.48214
# attitude vs gender
csv_table_read %>%
ggplot(aes(x=gender, y= attitude)) +
geom_boxplot()
# Type histogram
csv_table_read %>%
ggplot(aes(x = attitude)) +
geom_histogram() +
facet_grid(~ gender) +
labs(title = "Histogram of attitude by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# QQ plot: attitude per gender
csv_table_read %>%
ggplot(aes(sample = attitude)) +
geom_qq() +
geom_qq_line(colour = "blue") +
facet_grid(~gender)
# mean attitude per gender - This is significant and shows a difference between F and M on deep
csv_table_read %>%
t.test(attitude ~ gender, data = .)
##
## Welch Two Sample t-test
##
## data: attitude by gender
## t = -4.0932, df = 122.66, p-value = 7.657e-05
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -0.6718635 -0.2338508
## sample estimates:
## mean in group F mean in group M
## 2.990000 3.442857
# deep learning approach vs gender
# We could do that for all approach of learning
# Type histogram
csv_table_read %>%
ggplot(aes(x = deep)) +
geom_histogram() +
facet_grid(~ gender) +
labs(title = "Histogram of deep approach by Gender")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Type boxplot
csv_table_read %>%
ggplot(aes(y = deep, x = gender, fill = gender)) +
geom_boxplot() +
labs(title = "Boxplot of deep Approach by Gender")
# QQ plot: deep per gender
csv_table_read %>%
ggplot(aes(sample = deep)) +
geom_qq() +
geom_qq_line(colour = "blue") +
facet_grid(~gender)
# mean deep per gender - This is quite significant and could show a correlation between the gender and the approach deep
csv_table_read %>%
t.test(deep ~ gender, data = .)
##
## Welch Two Sample t-test
##
## data: deep by gender
## t = -0.72082, df = 101.32, p-value = 0.4727
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -0.2546963 0.1189279
## sample estimates:
## mean in group F mean in group M
## 3.656818 3.724702
# does not seem to impact on points
csv_table_read %>%
ggplot(aes(x= age, y=points)) +
geom_point()+
facet_wrap(~ gender) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
# does not seem to impact on attitude
csv_table_read %>%
ggplot(aes(x= age, y=attitude)) +
geom_point() +
facet_wrap(~ gender) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
# deep learning approach vs age - no correlation
# We could do that for all approach of learning
csv_table_read %>%
ggplot(aes(x= age, y= deep)) +
geom_point() +
facet_wrap(~ gender) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
# deep learning approach vs points - The deep approach seems to have a correlation with the number of points
csv_table_read %>%
ggplot(aes(x= deep, y=points)) +
geom_point() +
facet_wrap(~ gender) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
test_fit <- csv_table_read %>%
lm(points ~ deep * gender, data = .)
library(ggfortify)
summary(test_fit)
##
## Call:
## lm(formula = points ~ deep * gender, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3247 -3.3338 0.3369 4.6242 10.6787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.36387 3.91828 5.708 5.33e-08 ***
## deep -0.01001 1.06032 -0.009 0.992
## genderM 2.67476 6.41719 0.417 0.677
## deep:genderM -0.40787 1.71487 -0.238 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.922 on 162 degrees of freedom
## Multiple R-squared: 0.00922, Adjusted R-squared: -0.009127
## F-statistic: 0.5025 on 3 and 162 DF, p-value: 0.6811
# deep seems to have a significant impact on "points."
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")
library(broom)
library(ggplot2)
library(dplyr)
data <- read.csv("alc.csv",sep = ",", header = T)
class(data$high_use)
## [1] "logical"
class(data$famrel)
## [1] "integer"
class(data$goout)
## [1] "integer"
class(data$studytime)
## [1] "integer"
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)
Comments on the below codes chunk: There is a negative correlation between family relationship and the consumption of alcohol. When the relationship is not good (from 1 to 5), there are more risk to drink lot of alcohol. Pvalue is 0,0204 which is representative(<0.05). The coefficient for the variable Famrel in this model is -0.2838. The odds ratio associated with a one-unit change in Famrel is approximately exp(-0.2838) = 0.753. For a one-unit decrease in Famrel, the odds of ‘high_use’ decrease by about 1 - 0.753 = 24.7%. With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.49 and 3.35. For the famrel, the odds ratio lies between 0.59 and 0.95, since it is less than 1, there is a statistical significance, with a negative effect from the variable famrel to the high_use. The null deviance is bigger than the residual deviance of the model. It shows that the model explains a part of the variable high-use.
# I add the family = binomial as we do the analyze on a binary variable (True and False)
logistic_model_family_alcohol <- glm(high_use ~ famrel, data = data, family = binomial)
summary(logistic_model_family_alcohol)
##
## Call:
## glm(formula = high_use ~ famrel, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2573 0.4853 0.530 0.5960
## famrel -0.2838 0.1224 -2.318 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 446.67 on 368 degrees of freedom
## AIC: 450.67
##
## Number of Fisher Scoring iterations: 4
# get the coef
coef(logistic_model_family_alcohol) %>% exp()
## (Intercept) famrel
## 1.2934265 0.7528865
# odd of being in high use - exponential of the coef
exp(coef(logistic_model_family_alcohol))
## (Intercept) famrel
## 1.2934265 0.7528865
# get the intervals
confint(logistic_model_family_alcohol) %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.4961949 3.3547129
## famrel 0.5911258 0.9570208
# get the variance, BIC and AIC
logistic_model_family_alcohol %>%
glance()
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 452. 369 -223. 451. 458. 447. 368 370
# If I want to check the impact of high use to family relationship -> famrel is ordinal so I use the ordinal regression model
library(ordinal)
##
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
##
## slice
ordinal_model_family_alcohol <- clm(factor(famrel) ~ high_use, data = data)
summary(ordinal_model_family_alcohol)
## formula: factor(famrel) ~ high_use
## data: data
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 370 -454.70 919.40 5(0) 3.77e-08 2.0e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## high_useTRUE -0.5411 0.2139 -2.529 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 -4.0079 0.3672 -10.914
## 2|3 -2.7746 0.2194 -12.645
## 3|4 -1.3118 0.1422 -9.222
## 4|5 0.8468 0.1300 6.516
# This box plot confirms our hypothesis and conclusion. We can see that the mean and median of the family relationship quality is very different whether the students drink a lot or not.
data %>%
ggplot(aes(x= factor(high_use), y= famrel)) +
geom_boxplot() +
labs(title = "Quality of family relationship vs the alcohol consumption",
x = "High consumption",
y = "quality of family relationship")
data %>%
ggplot(aes(x = as.factor(famrel), fill = as.factor(high_use))) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Family relationship and Alcohol Consumption",
x = "Family relation score from 1 to 5",
y = "Proportion of students",
fill = "High Use of alcohol: True of False"
)
# there are less score 5 and way more score 2 among the students who drink a lot
data %>%
ggplot(aes(x = as.factor(high_use), fill = as.factor(famrel))) +
geom_bar(position = "fill") +
labs(
title = "Family relation score distribution by Alcohol Consumption",
x = "High Use of alcohol: True of False",
y = "Proportion of students",
fill = "Family relation score from 1 to 5"
)
Comments on the below codes chunk: There is a a positive relationship between going out and the consumption of alcohol and a significant p value (very small value), which confirms the hypothesis. The more students go out the more risks they have to have a high consumption in alcohol.
The coefficient for goout is 0.7563.The odds ratio associated with a one-unit change in Goout is approximately exp(0.7563) = 2.13.For a one-unit increase in Goout, the odds of ‘high_use’ increase by about 2.13 - 1 = 1.13, which is about 113% increase.
With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.015 and 0.078. For the goout, the odds ratio lies between 1.71 and 2.66, since it is > 1, there is a statistical significance with a positive effect of this variable on the high_use.
The null deviance is way bigger than the residual deviance of the model. It shows that the model explains the variable high-use, even better than in the hypothesis 1. The AIC of this model is even smaller than the previous model too, which confirms my conclusion.
logistic_model_goout_alcohol <- glm(high_use ~ goout, data = data, family = binomial)
summary(logistic_model_goout_alcohol)
##
## Call:
## glm(formula = high_use ~ goout, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3368 0.4188 -7.968 1.62e-15 ***
## goout 0.7563 0.1165 6.494 8.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 403.05 on 368 degrees of freedom
## AIC: 407.05
##
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_goout_alcohol)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.34 0.419 -7.97 1.62e-15
## 2 goout 0.756 0.116 6.49 8.34e-11
# get the coef
coef(logistic_model_goout_alcohol) %>% exp()
## (Intercept) goout
## 0.03554946 2.13048020
# get the exp of the coef
exp(coef(logistic_model_goout_alcohol))
## (Intercept) goout
## 0.03554946 2.13048020
# get the intervals
confint(logistic_model_goout_alcohol) %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.01515176 0.07850224
## goout 1.70566958 2.69518608
# plots
data %>%
ggplot(aes(x= factor(high_use), y= goout)) +
geom_boxplot() +
labs(title = "Going out with friends vs the alcohol consumption",
x = "High consumption",
y = "go out with friends (from 1 to 5)")
# the students who are drinking more are more numerous among the students who gave the score 4 and 5 for the goout variable.
data %>%
ggplot(aes(x = factor(goout), fill = factor(high_use))) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Going Out Behavior and Alcohol Consumption",
x = "going out with friends from 1 to 5",
y = "Proportion of students",
fill = "High Use of alcohol: True of False"
)
# there are more score 1, 2 and 3 among the students who do not drink a lot
data %>%
ggplot(aes(x = factor(high_use), fill = factor(goout))) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Alcohol Consumption and going out behavior",
x = "High Use of alcohol: True of False",
y = "Proportion of students",
fill = "going out with friends from 1 to 5"
)
Comments about the below code chunk: The p value is very small and estimate is negative. We can conclude there is a negative relation between the study time and the alcohol consumption. When the study time increases the odds of high alcohol consumption decreases.
The coefficient for Studytime is -0.6208. The odds ratio associated with a one-unit change in Studytime is approximately exp(-0.6208) = 0.538. For a one-unit decrease in Studytime, the odds of ‘high_use’ decrease by about 1 - 0.538 = 46.2%.
With 95% confidence, the true value of the odds ratio for the intercept lies between approximately 0.79 and 2.67. For the study time, the odds ratio lies between 0.39 and 0.72, since it is < 1, there is a statistical significance with a negative effect of this variable on the high_use.
The residual deviance is smaller than the null deviance, it means the model explains the variable high use. The AIC is smaller than the first model but not as small as the second model.
logistic_model_studytime_alcohol <- glm(high_use ~ studytime, data = data, family = binomial)
summary(logistic_model_studytime_alcohol)
##
## Call:
## glm(formula = high_use ~ studytime, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3649 0.3102 1.176 0.239
## studytime -0.6208 0.1542 -4.027 5.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 433.82 on 368 degrees of freedom
## AIC: 437.82
##
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_studytime_alcohol)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.365 0.310 1.18 0.239
## 2 studytime -0.621 0.154 -4.03 0.0000566
# get the coef
coef(logistic_model_studytime_alcohol) %>% exp()
## (Intercept) studytime
## 1.4403849 0.5375316
#Get the odds - exponential of coefficient:
exp(coef(logistic_model_studytime_alcohol))
## (Intercept) studytime
## 1.4403849 0.5375316
# get the intervals
confint(logistic_model_studytime_alcohol) %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.7881099 2.6654451
## studytime 0.3933727 0.7208174
# This plot shows the difference in median and mean for the students who drink a lot or not
data %>%
ggplot(aes(x= factor(high_use), y= studytime)) +
geom_boxplot() +
labs(title = "Study time vs the alcohol consumption",
x = "High consumption",
y = "Study time (from 1 to 4)")
# way more students among students who score they study time around 1 and 2
data %>%
ggplot(aes(x = factor(studytime), fill = factor(high_use))) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Study time Behavior and Alcohol Consumption",
x = "Study time from 1 to 4",
y = "Proportion of students",
fill = "High Use of alcohol: True of False"
)
# way more score 1 and way less 3 and 4 among students who drink a lot.
data %>%
ggplot(aes(x = factor(high_use), fill = factor(studytime))) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Alcohol Consumption and Study time behavior",
x = "High Use of alcohol: True of False",
y = "Proportion of students",
fill = "Study time from 1 to 4"
)
Comments about the below code: The p value is 0,022 which tells that there is a correlation between the reason to choose a school and the consumption of alcohol.
We don’t know with the test which reason is the most chosen by the students who drink a lot of alcohol vs the ones who don’t. Some plots can help us identify it (check below).
# Creating a contingency table
chi_table_alcohol_reasons <- table((data$high_use), as.factor(data$reason))
# Performing the chi-squared test
chi_square_test <- chisq.test(chi_table_alcohol_reasons )
chi_square_test
##
## Pearson's Chi-squared test
##
## data: chi_table_alcohol_reasons
## X-squared = 9.563, df = 3, p-value = 0.02267
# Summary of the test
summary(chi_square_test)
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## method 1 -none- character
## data.name 1 -none- character
## observed 8 table numeric
## expected 8 -none- numeric
## residuals 8 table numeric
## stdres 8 table numeric
# This show that students who don't drink a lot have chosen their school more for its reputation than the one who drink a lot. For the ones who drink, the reasons home and other are way more important than for the students who don't drink too much. The reason courses seem to be equally important for them.
data %>%
ggplot(aes(x = high_use, fill = reason)) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Reason to chose their school and Alcohol Consumption",
x = "High Consumption of a",
y = "Proportion of students",
fill = "High Use of alcohol: True of False"
)
data %>%
ggplot(aes(x = reason, fill = high_use)) +
geom_bar(position = "fill") +
labs(
title = "Relationship Between Alcohol Consumption and Reason to chose their school",
x = "High Use of alcohol: True of False",
y = "Proportion of students",
fill = "Reason to choose the school"
)
By adding each variable together, we can see that the variation in high use is better explained (this is because the deviance is decreasing from 452 to 379).
All three variables (goout, famrel, and studytime) have statistical significance in predicting alcohol consumption (high_use).
logistic_model_3var <- glm(high_use ~ goout + famrel + studytime, data = data, family = binomial)
logistic_model_3var_2 <- glm(high_use ~ goout * famrel * studytime, data = data, family = binomial)
summary(logistic_model_3var)
##
## Call:
## glm(formula = high_use ~ goout + famrel + studytime, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7286 0.6941 -1.050 0.29381
## goout 0.7777 0.1203 6.467 9.99e-11 ***
## famrel -0.3777 0.1367 -2.763 0.00573 **
## studytime -0.6139 0.1672 -3.673 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 379.74 on 366 degrees of freedom
## AIC: 387.74
##
## Number of Fisher Scoring iterations: 4
tidy(logistic_model_3var)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.729 0.694 -1.05 2.94e- 1
## 2 goout 0.778 0.120 6.47 9.99e-11
## 3 famrel -0.378 0.137 -2.76 5.73e- 3
## 4 studytime -0.614 0.167 -3.67 2.40e- 4
# Anova conclusion: The p-value associated with the difference in deviance (Pr(>Chi)) is 0.1362, suggesting that the difference in fit between the two models is not statistically significant
anova(logistic_model_3var, logistic_model_3var_2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ goout * famrel * studytime
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 366 379.74
## 2 362 372.74 4 6.995 0.1362
# when comparing with for instanc: logistic_model_family_alcohol, the residual deviance of logistic_model_3var is better and the p is significant.
anova(logistic_model_3var, logistic_model_family_alcohol, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ goout + famrel + studytime
## Model 2: high_use ~ famrel
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 366 379.74
## 2 368 446.67 -2 -66.934 2.921e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# best model is the logistic_model_3var according to BIC and AIC, because metrics are smaller for this model
AIC(logistic_model_3var, logistic_model_3var_2)
## df AIC
## logistic_model_3var 4 387.7372
## logistic_model_3var_2 8 388.7422
BIC(logistic_model_3var, logistic_model_3var_2)
## df BIC
## logistic_model_3var 4 403.3912
## logistic_model_3var_2 8 420.0502
Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
# prediction based on the model: logistic_model_3var
predictions <- predict(logistic_model_3var, type = "response")
#Create the confusion matrix:
table(data$high_use, predictions > 0.5)
##
## FALSE TRUE
## FALSE 235 24
## TRUE 63 48
# correct are False false + true true = 235 + 48 = 293
# wrong are False True and True False = 24 + 63 = 87
# Probability of miss classifications = 87/(293+87) = 23%
# odds of miss classification = 87/293 = 30%
# precision: correctly positive among all positives predictions = 0.33%
precision <- 24 / (24 + 48)
# recall: correctly positive among all actual positives = 0.28%
recall <- 24 / (24 + 63)
# F1 score: 0.30%
2 * (precision * recall) / (precision + recall)
## [1] 0.3018868
# Accuracy of always predicting the majority class: if the majority class is "FALSE," then predicting "FALSE" for every instance would result in a correct prediction rate of 70%
mean(data$high_use == FALSE)
## [1] 0.7
# Calculate the total proportion of errors is 23.5%. This means my model is a bit better than the majority class which is 70% (30% of non-accuracy).
mean(data$high_use != (predictions > 0.5))
## [1] 0.2351351
# to conclude, my model is a bit better than using the prediction based on the majority class. But it should be improved for better accuracy!
I had to look at chat gpt fully for this one. I did not know the method. I now understand this method splits the data in 10 data samples and will use 9 of them to predict the last one, rotating to test them all. In that case, the average of prediction for the different tests is 74.9% which is better than the 70% (majority class).
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(lattice)
# Define the model formula
formula <- factor(high_use) ~ goout + famrel + studytime
# Define the number of folds for cross-validation
num_folds <- 10
# Create a training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = num_folds)
# Train the model using 10-fold cross-validation
model <- train(formula, data = data, method = "glm", family = binomial, trControl = train_control)
# Get cross-validated results
results <- model$resample
str(results)
## 'data.frame': 10 obs. of 3 variables:
## $ Accuracy: num 0.811 0.784 0.757 0.73 0.73 ...
## $ Kappa : num 0.479 0.345 0.287 0.232 0.317 ...
## $ Resample: chr "Fold01" "Fold02" "Fold03" "Fold04" ...
# Calculate mean prediction error
mean(results$Accuracy)
## [1] 0.7430575
The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.
Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.
This data frame contains the following columns:
crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")
Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506 14
# Data type: all num or int
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)
# Let's analyze the histogram for each variable. To use the key facet wrap that will draw one chart per key, we need to have a table with 2 columns: the keys which are all the variables and then the metric.
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)
# Visualize correlation matrix as a heatmap
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)
according to the matrix - Negative relationships between certain variables:
lstat and medv - lower status of population and the median price of homes owned by occupants
lstat and rm - lower status of population and the average of room numbers per home
tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants
dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population
dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.
dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.
dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).
tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers
zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940
zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.
zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).
Positive relationships between variables
medv and rm - the median price of homes owned by occupants and the average of room numbers per home
tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.
tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.
age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.
age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.
nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.
–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
model1 <- glm(lstat ~ medv + dis + rm,data=Boston)
model2 <- glm(medv ~ rm + tax + lstat ,data=Boston)
# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
## medv dis rm
## 1.982257 1.068810 1.940168
vif(model2)
## rm tax lstat
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv 6.606608e-01 6.250484e-01 6.983022e-01
## dis 3.292580e-01 2.756488e-01 3.932934e-01
## rm 1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.6073487 0.001289619 286.0319984
## rm 181.1868455 76.546116788 428.8744403
## tax 0.9935198 0.990167804 0.9968832
## lstat 0.5754723 0.522466602 0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
##
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.40940 1.92918 19.391 < 2e-16 ***
## medv -0.41451 0.02827 -14.662 < 2e-16 ***
## dis -1.11091 0.09067 -12.252 < 2e-16 ***
## rm -1.78215 0.36612 -4.868 1.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 17.22406)
##
## Null deviance: 25752.4 on 505 degrees of freedom
## Residual deviance: 8646.5 on 502 degrees of freedom
## AIC: 2882.2
##
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <- glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
##
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.41052 3.64466 20.42 <2e-16 ***
## medv -1.94316 0.13517 -14.38 <2e-16 ***
## dis -0.89569 0.08285 -10.81 <2e-16 ***
## rm -7.55541 0.59817 -12.63 <2e-16 ***
## medv:rm 0.22526 0.01957 11.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 13.64918)
##
## Null deviance: 25752.4 on 505 degrees of freedom
## Residual deviance: 6838.2 on 501 degrees of freedom
## AIC: 2765.5
##
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance.
summary(model2)
##
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.498652 3.140239 -0.159 0.873895
## rm 5.199529 0.439618 11.827 < 2e-16 ***
## tax -0.006501 0.001724 -3.770 0.000182 ***
## lstat -0.552564 0.049302 -11.208 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 29.90866)
##
## Null deviance: 42716 on 505 degrees of freedom
## Residual deviance: 15014 on 502 degrees of freedom
## AIC: 3161.4
##
## Number of Fisher Scoring iterations: 2
Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.
# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]
# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)
# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)
# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category.
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")
scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.
# drop the former column crim and create a new table.
Boston_new <- scaled_table_boston %>% dplyr::select(-crim)
# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))
# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)
# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.
It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.
LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.
LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.
# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.
library(MASS)
# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)
# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels
# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime
# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.
# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))
plot_data
## LD1 LD2 prediction_crime_quantile
## 11 -1.7420309 -0.563988780 second_quantil
## 18 -1.8844395 -0.697588155 second_quantil
## 32 -1.7989232 -0.641298376 third_quantil
## 41 -3.3024128 2.332770951 premier_quantil
## 51 -2.6872906 0.702647342 second_quantil
## 63 -0.9375197 1.032234362 second_quantil
## 64 -1.1380939 0.895635746 second_quantil
## 68 -2.9480894 1.000450512 second_quantil
## 78 -2.1718156 0.367482743 second_quantil
## 79 -2.1020789 0.095429877 second_quantil
## 81 -2.8339574 1.043613561 premier_quantil
## 82 -2.4943084 0.852991483 premier_quantil
## 83 -2.7888610 1.180650907 premier_quantil
## 93 -2.4293889 0.768682385 premier_quantil
## 94 -2.6298188 0.975433516 premier_quantil
## 99 -3.5830580 -0.579652967 second_quantil
## 100 -3.3924426 -0.366355720 second_quantil
## 118 -1.2506981 -0.354287099 second_quantil
## 120 -1.2747237 -0.325676369 second_quantil
## 126 -2.3470023 -1.640284168 third_quantil
## 132 -1.4990246 -1.629460840 third_quantil
## 134 -1.3807754 -1.650234034 third_quantil
## 135 -1.1294140 -1.622673961 third_quantil
## 140 -1.3196570 -1.723866077 third_quantil
## 144 -0.5080338 -3.491441580 third_quantil
## 146 -0.2705193 -3.518305450 third_quantil
## 148 -0.3877692 -3.500527255 third_quantil
## 158 -1.3539425 -2.034963070 third_quantil
## 164 -2.0175909 -2.767803439 third_quantil
## 170 -1.3953191 -1.375795687 third_quantil
## 175 -1.8608627 -0.031188923 second_quantil
## 176 -2.3184829 0.030690369 second_quantil
## 177 -2.1065197 0.007678271 second_quantil
## 182 -2.5298343 -0.643970906 second_quantil
## 184 -2.4589965 -0.653150709 second_quantil
## 187 -2.8229685 -1.211514661 third_quantil
## 191 -2.5973701 1.225700683 premier_quantil
## 196 -2.8588061 1.707946563 premier_quantil
## 210 -1.9930583 -1.189720536 second_quantil
## 218 -1.6566282 -1.117301292 third_quantil
## 225 -0.7268859 -0.835236102 third_quantil
## 226 -0.6633513 -1.179021989 third_quantil
## 229 -1.1730584 -0.399556148 third_quantil
## 232 -0.7991376 -0.255471089 third_quantil
## 234 -0.7637663 -1.026402806 third_quantil
## 245 -1.1156092 0.873602233 second_quantil
## 249 -1.4723955 0.852167517 second_quantil
## 256 -4.2497725 2.545835892 premier_quantil
## 268 -1.8905155 -1.274270264 third_quantil
## 272 -3.2218642 0.838145928 premier_quantil
## 284 -4.6143460 1.573241774 premier_quantil
## 285 -4.2209652 2.403817499 premier_quantil
## 288 -2.1702161 2.212507868 premier_quantil
## 304 -1.8587022 1.431151136 premier_quantil
## 308 -1.0963826 1.136360108 premier_quantil
## 310 -2.0673156 -0.553660642 second_quantil
## 312 -2.3588626 -0.235652464 second_quantil
## 326 -2.5185568 0.241555955 second_quantil
## 328 -2.0504772 -0.182211776 second_quantil
## 338 -1.9274134 -0.158499006 second_quantil
## 350 -4.1510182 0.873990045 premier_quantil
## 351 -4.0497266 0.995543314 premier_quantil
## 355 -2.7916212 2.766354358 premier_quantil
## 358 6.1825650 -0.836499387 fourth_quantil
## 367 6.9557388 -0.061192076 fourth_quantil
## 371 6.1538773 -0.737819596 fourth_quantil
## 373 6.5663619 -1.096027676 fourth_quantil
## 377 6.6204308 0.200138729 fourth_quantil
## 380 6.6120904 0.407812674 fourth_quantil
## 381 6.2930988 0.599486138 fourth_quantil
## 385 7.2695019 0.097548262 fourth_quantil
## 387 7.0960326 0.018298618 fourth_quantil
## 397 6.5453756 0.227228952 fourth_quantil
## 410 6.9330625 0.130268208 fourth_quantil
## 411 6.9725952 1.202108257 fourth_quantil
## 412 7.0569165 0.569561509 fourth_quantil
## 413 7.7509117 0.137456881 fourth_quantil
## 423 6.6198484 0.708563545 fourth_quantil
## 424 7.0644996 0.659906485 fourth_quantil
## 427 6.5551119 1.583621100 fourth_quantil
## 431 6.7293783 1.030784929 fourth_quantil
## 433 6.4745453 1.259962518 fourth_quantil
## 435 6.9353222 0.244550044 fourth_quantil
## 437 7.0841354 0.047739915 fourth_quantil
## 438 7.4187787 -0.279592429 fourth_quantil
## 441 6.7356991 -0.077765024 fourth_quantil
## 443 6.7194681 -0.341344771 fourth_quantil
## 444 6.6920760 -0.304925027 fourth_quantil
## 446 7.2518690 -0.311106112 fourth_quantil
## 447 6.7547338 -0.209276638 fourth_quantil
## 448 6.5858542 -0.047964040 fourth_quantil
## 453 6.5581085 -0.033751653 fourth_quantil
## 460 6.4962033 -0.073509396 fourth_quantil
## 464 6.3691960 0.010711213 fourth_quantil
## 471 6.2523209 0.800009717 fourth_quantil
## 484 5.6806078 1.592947743 fourth_quantil
## 491 -0.9753120 -1.722043654 second_quantil
## 494 -1.2172576 -0.513483292 third_quantil
## 497 -0.8234830 -0.967500760 third_quantil
## 499 -1.1581295 -0.643361810 third_quantil
## 501 -1.0697983 -0.621807365 third_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
geom_point() +
labs(title = "LDA Biplot with Predicted Crime Quantiles")
plot_LDA
# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA +
geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
labs(color = "Real Quantiles of Crime")
realVsPred_plot
# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 77.23 %"
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
# save the crime data:
actual_crime_categories <- test_data$quantiles_crime
# Remove the categorical crime variable from the test dataset
test_data_without_crime <- test_data %>% dplyr::select(-quantiles_crime)
# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class
# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
## Actual
## Predicted premier_quantil second_quantil third_quantil fourth_quantil
## premier_quantil 15 3 0 0
## second_quantil 8 14 4 0
## third_quantil 2 4 17 1
## fourth_quantil 0 0 1 32
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
## Actual
## Predicted premier_quantil second_quantil third_quantil fourth_quantil
## premier_quantil 15 3 0 0
## second_quantil 8 14 4 0
## third_quantil 2 4 17 1
## fourth_quantil 0 0 1 32
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
# Boston is reload
Boston_load <- Boston
# scale of Boston
Boston_scaled <- scale(Boston_load)
# Calculate distance betwwen scaled data points:
distances <- dist(Boston_scaled)
# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse.
# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10) # Initialize within-cluster sum of squares vector
for (i in 1:10) {
kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
wss[i] <- kmeans_model$tot.withinss # Store within-cluster sum of squares
}
# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")
# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")
# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)
k <- 2 # 2 seems to be the best option according to the Elbow Method and the silhouette method
#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)
cluster_assignments <- kmeans_model$cluster
# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)
library(GGally)
# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))
# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Mean for each cluster and variable:
clustered_data %>%
mutate(Cluster = kmeans_model$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 2 × 15
## Cluster crim zn indus chas nox rm age dis rad
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.389 0.262 -0.615 0.00291 -0.582 0.245 -0.433 0.454 -0.583
## 2 2 0.724 -0.487 1.14 -0.00541 1.08 -0.455 0.805 -0.844 1.08
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## # medv <dbl>
model_predictors <- train_data %>% dplyr::select(-quantiles_crime)
# check the dimensions
dim(model_predictors)
## [1] 405 13
dim(lda_model$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_model$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ordinal':
##
## slice
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(tibble)
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(tidyr)
library(GGally)
human_new <- read_csv("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project/human_new.csv", show_col_types = FALSE)
str(human_new)
## spc_tbl_ [155 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ HDI.Rank : num [1:155] 1 2 3 4 5 6 6 8 9 9 ...
## $ Country : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## $ HDI : num [1:155] 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
## $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Edu.Mean : num [1:155] 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
## $ GNI : num [1:155] 64992 42261 56431 44025 45435 ...
## $ GNI.Minus.Rank: num [1:155] 5 17 6 11 9 11 16 3 11 23 ...
## $ GII.Rank : num [1:155] 1 2 3 4 5 6 6 8 9 9 ...
## $ GII : num [1:155] 0.067 0.11 0.028 0.048 0.062 0.041 0.113 0.28 0.129 0.157 ...
## $ Mat.Mor : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth : num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ Edu2.F : num [1:155] 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ Edu2.M : num [1:155] 96.7 94.6 96.6 96.6 90.5 97 78.6 94.8 100 95.3 ...
## $ Labo.F : num [1:155] 61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
## $ Labo.M : num [1:155] 68.7 71.8 74.9 66.4 70.6 66.4 68.1 68.9 71 73.8 ...
## $ Edu2.FM : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
## - attr(*, "spec")=
## .. cols(
## .. HDI.Rank = col_double(),
## .. Country = col_character(),
## .. HDI = col_double(),
## .. Life.Exp = col_double(),
## .. Edu.Exp = col_double(),
## .. Edu.Mean = col_double(),
## .. GNI = col_double(),
## .. GNI.Minus.Rank = col_double(),
## .. GII.Rank = col_double(),
## .. GII = col_double(),
## .. Mat.Mor = col_double(),
## .. Ado.Birth = col_double(),
## .. Parli.F = col_double(),
## .. Edu2.F = col_double(),
## .. Edu2.M = col_double(),
## .. Labo.F = col_double(),
## .. Labo.M = col_double(),
## .. Edu2.FM = col_double(),
## .. Labo.FM = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Country is as rownames
human_new <- column_to_rownames(human_new, "Country")
is.na(human_new)
## HDI.Rank HDI Life.Exp Edu.Exp
## Norway FALSE FALSE FALSE FALSE
## Australia FALSE FALSE FALSE FALSE
## Switzerland FALSE FALSE FALSE FALSE
## Denmark FALSE FALSE FALSE FALSE
## Netherlands FALSE FALSE FALSE FALSE
## Germany FALSE FALSE FALSE FALSE
## Ireland FALSE FALSE FALSE FALSE
## United States FALSE FALSE FALSE FALSE
## Canada FALSE FALSE FALSE FALSE
## New Zealand FALSE FALSE FALSE FALSE
## Singapore FALSE FALSE FALSE FALSE
## Sweden FALSE FALSE FALSE FALSE
## United Kingdom FALSE FALSE FALSE FALSE
## Iceland FALSE FALSE FALSE FALSE
## Korea (Republic of) FALSE FALSE FALSE FALSE
## Israel FALSE FALSE FALSE FALSE
## Luxembourg FALSE FALSE FALSE FALSE
## Japan FALSE FALSE FALSE FALSE
## Belgium FALSE FALSE FALSE FALSE
## France FALSE FALSE FALSE FALSE
## Austria FALSE FALSE FALSE FALSE
## Finland FALSE FALSE FALSE FALSE
## Slovenia FALSE FALSE FALSE FALSE
## Spain FALSE FALSE FALSE FALSE
## Italy FALSE FALSE FALSE FALSE
## Czech Republic FALSE FALSE FALSE FALSE
## Greece FALSE FALSE FALSE FALSE
## Estonia FALSE FALSE FALSE FALSE
## Cyprus FALSE FALSE FALSE FALSE
## Qatar FALSE FALSE FALSE FALSE
## Slovakia FALSE FALSE FALSE FALSE
## Poland FALSE FALSE FALSE FALSE
## Lithuania FALSE FALSE FALSE FALSE
## Malta FALSE FALSE FALSE FALSE
## Saudi Arabia FALSE FALSE FALSE FALSE
## Argentina FALSE FALSE FALSE FALSE
## United Arab Emirates FALSE FALSE FALSE FALSE
## Chile FALSE FALSE FALSE FALSE
## Portugal FALSE FALSE FALSE FALSE
## Hungary FALSE FALSE FALSE FALSE
## Bahrain FALSE FALSE FALSE FALSE
## Latvia FALSE FALSE FALSE FALSE
## Croatia FALSE FALSE FALSE FALSE
## Kuwait FALSE FALSE FALSE FALSE
## Montenegro FALSE FALSE FALSE FALSE
## Belarus FALSE FALSE FALSE FALSE
## Russian Federation FALSE FALSE FALSE FALSE
## Oman FALSE FALSE FALSE FALSE
## Romania FALSE FALSE FALSE FALSE
## Uruguay FALSE FALSE FALSE FALSE
## Bahamas FALSE FALSE FALSE FALSE
## Kazakhstan FALSE FALSE FALSE FALSE
## Barbados FALSE FALSE FALSE FALSE
## Bulgaria FALSE FALSE FALSE FALSE
## Panama FALSE FALSE FALSE FALSE
## Malaysia FALSE FALSE FALSE FALSE
## Mauritius FALSE FALSE FALSE FALSE
## Trinidad and Tobago FALSE FALSE FALSE FALSE
## Serbia FALSE FALSE FALSE FALSE
## Cuba FALSE FALSE FALSE FALSE
## Lebanon FALSE FALSE FALSE FALSE
## Costa Rica FALSE FALSE FALSE FALSE
## Iran (Islamic Republic of) FALSE FALSE FALSE FALSE
## Venezuela (Bolivarian Republic of) FALSE FALSE FALSE FALSE
## Turkey FALSE FALSE FALSE FALSE
## Sri Lanka FALSE FALSE FALSE FALSE
## Mexico FALSE FALSE FALSE FALSE
## Brazil FALSE FALSE FALSE FALSE
## Georgia FALSE FALSE FALSE FALSE
## Azerbaijan FALSE FALSE FALSE FALSE
## Jordan FALSE FALSE FALSE FALSE
## The former Yugoslav Republic of Macedonia FALSE FALSE FALSE FALSE
## Ukraine FALSE FALSE FALSE FALSE
## Algeria FALSE FALSE FALSE FALSE
## Peru FALSE FALSE FALSE FALSE
## Albania FALSE FALSE FALSE FALSE
## Armenia FALSE FALSE FALSE FALSE
## Bosnia and Herzegovina FALSE FALSE FALSE FALSE
## Ecuador FALSE FALSE FALSE FALSE
## China FALSE FALSE FALSE FALSE
## Fiji FALSE FALSE FALSE FALSE
## Mongolia FALSE FALSE FALSE FALSE
## Thailand FALSE FALSE FALSE FALSE
## Libya FALSE FALSE FALSE FALSE
## Tunisia FALSE FALSE FALSE FALSE
## Colombia FALSE FALSE FALSE FALSE
## Jamaica FALSE FALSE FALSE FALSE
## Tonga FALSE FALSE FALSE FALSE
## Belize FALSE FALSE FALSE FALSE
## Dominican Republic FALSE FALSE FALSE FALSE
## Suriname FALSE FALSE FALSE FALSE
## Maldives FALSE FALSE FALSE FALSE
## Samoa FALSE FALSE FALSE FALSE
## Botswana FALSE FALSE FALSE FALSE
## Moldova (Republic of) FALSE FALSE FALSE FALSE
## Egypt FALSE FALSE FALSE FALSE
## Gabon FALSE FALSE FALSE FALSE
## Indonesia FALSE FALSE FALSE FALSE
## Paraguay FALSE FALSE FALSE FALSE
## Philippines FALSE FALSE FALSE FALSE
## El Salvador FALSE FALSE FALSE FALSE
## South Africa FALSE FALSE FALSE FALSE
## Viet Nam FALSE FALSE FALSE FALSE
## Bolivia (Plurinational State of) FALSE FALSE FALSE FALSE
## Kyrgyzstan FALSE FALSE FALSE FALSE
## Iraq FALSE FALSE FALSE FALSE
## Guyana FALSE FALSE FALSE FALSE
## Nicaragua FALSE FALSE FALSE FALSE
## Morocco FALSE FALSE FALSE FALSE
## Namibia FALSE FALSE FALSE FALSE
## Guatemala FALSE FALSE FALSE FALSE
## Tajikistan FALSE FALSE FALSE FALSE
## India FALSE FALSE FALSE FALSE
## Honduras FALSE FALSE FALSE FALSE
## Bhutan FALSE FALSE FALSE FALSE
## Syrian Arab Republic FALSE FALSE FALSE FALSE
## Congo FALSE FALSE FALSE FALSE
## Zambia FALSE FALSE FALSE FALSE
## Ghana FALSE FALSE FALSE FALSE
## Bangladesh FALSE FALSE FALSE FALSE
## Cambodia FALSE FALSE FALSE FALSE
## Kenya FALSE FALSE FALSE FALSE
## Nepal FALSE FALSE FALSE FALSE
## Pakistan FALSE FALSE FALSE FALSE
## Myanmar FALSE FALSE FALSE FALSE
## Swaziland FALSE FALSE FALSE FALSE
## Tanzania (United Republic of) FALSE FALSE FALSE FALSE
## Cameroon FALSE FALSE FALSE FALSE
## Zimbabwe FALSE FALSE FALSE FALSE
## Mauritania FALSE FALSE FALSE FALSE
## Papua New Guinea FALSE FALSE FALSE FALSE
## Yemen FALSE FALSE FALSE FALSE
## Lesotho FALSE FALSE FALSE FALSE
## Togo FALSE FALSE FALSE FALSE
## Haiti FALSE FALSE FALSE FALSE
## Rwanda FALSE FALSE FALSE FALSE
## Uganda FALSE FALSE FALSE FALSE
## Benin FALSE FALSE FALSE FALSE
## Sudan FALSE FALSE FALSE FALSE
## Senegal FALSE FALSE FALSE FALSE
## Afghanistan FALSE FALSE FALSE FALSE
## Côte d'Ivoire FALSE FALSE FALSE FALSE
## Malawi FALSE FALSE FALSE FALSE
## Ethiopia FALSE FALSE FALSE FALSE
## Gambia FALSE FALSE FALSE FALSE
## Congo (Democratic Republic of the) FALSE FALSE FALSE FALSE
## Liberia FALSE FALSE FALSE FALSE
## Mali FALSE FALSE FALSE FALSE
## Mozambique FALSE FALSE FALSE FALSE
## Sierra Leone FALSE FALSE FALSE FALSE
## Burkina Faso FALSE FALSE FALSE FALSE
## Burundi FALSE FALSE FALSE FALSE
## Chad FALSE FALSE FALSE FALSE
## Central African Republic FALSE FALSE FALSE FALSE
## Niger FALSE FALSE FALSE FALSE
## Edu.Mean GNI GNI.Minus.Rank
## Norway FALSE FALSE FALSE
## Australia FALSE FALSE FALSE
## Switzerland FALSE FALSE FALSE
## Denmark FALSE FALSE FALSE
## Netherlands FALSE FALSE FALSE
## Germany FALSE FALSE FALSE
## Ireland FALSE FALSE FALSE
## United States FALSE FALSE FALSE
## Canada FALSE FALSE FALSE
## New Zealand FALSE FALSE FALSE
## Singapore FALSE FALSE FALSE
## Sweden FALSE FALSE FALSE
## United Kingdom FALSE FALSE FALSE
## Iceland FALSE FALSE FALSE
## Korea (Republic of) FALSE FALSE FALSE
## Israel FALSE FALSE FALSE
## Luxembourg FALSE FALSE FALSE
## Japan FALSE FALSE FALSE
## Belgium FALSE FALSE FALSE
## France FALSE FALSE FALSE
## Austria FALSE FALSE FALSE
## Finland FALSE FALSE FALSE
## Slovenia FALSE FALSE FALSE
## Spain FALSE FALSE FALSE
## Italy FALSE FALSE FALSE
## Czech Republic FALSE FALSE FALSE
## Greece FALSE FALSE FALSE
## Estonia FALSE FALSE FALSE
## Cyprus FALSE FALSE FALSE
## Qatar FALSE FALSE FALSE
## Slovakia FALSE FALSE FALSE
## Poland FALSE FALSE FALSE
## Lithuania FALSE FALSE FALSE
## Malta FALSE FALSE FALSE
## Saudi Arabia FALSE FALSE FALSE
## Argentina FALSE FALSE FALSE
## United Arab Emirates FALSE FALSE FALSE
## Chile FALSE FALSE FALSE
## Portugal FALSE FALSE FALSE
## Hungary FALSE FALSE FALSE
## Bahrain FALSE FALSE FALSE
## Latvia FALSE FALSE FALSE
## Croatia FALSE FALSE FALSE
## Kuwait FALSE FALSE FALSE
## Montenegro FALSE FALSE FALSE
## Belarus FALSE FALSE FALSE
## Russian Federation FALSE FALSE FALSE
## Oman FALSE FALSE FALSE
## Romania FALSE FALSE FALSE
## Uruguay FALSE FALSE FALSE
## Bahamas FALSE FALSE FALSE
## Kazakhstan FALSE FALSE FALSE
## Barbados FALSE FALSE FALSE
## Bulgaria FALSE FALSE FALSE
## Panama FALSE FALSE FALSE
## Malaysia FALSE FALSE FALSE
## Mauritius FALSE FALSE FALSE
## Trinidad and Tobago FALSE FALSE FALSE
## Serbia FALSE FALSE FALSE
## Cuba FALSE FALSE FALSE
## Lebanon FALSE FALSE FALSE
## Costa Rica FALSE FALSE FALSE
## Iran (Islamic Republic of) FALSE FALSE FALSE
## Venezuela (Bolivarian Republic of) FALSE FALSE FALSE
## Turkey FALSE FALSE FALSE
## Sri Lanka FALSE FALSE FALSE
## Mexico FALSE FALSE FALSE
## Brazil FALSE FALSE FALSE
## Georgia FALSE FALSE FALSE
## Azerbaijan FALSE FALSE FALSE
## Jordan FALSE FALSE FALSE
## The former Yugoslav Republic of Macedonia FALSE FALSE FALSE
## Ukraine FALSE FALSE FALSE
## Algeria FALSE FALSE FALSE
## Peru FALSE FALSE FALSE
## Albania FALSE FALSE FALSE
## Armenia FALSE FALSE FALSE
## Bosnia and Herzegovina FALSE FALSE FALSE
## Ecuador FALSE FALSE FALSE
## China FALSE FALSE FALSE
## Fiji FALSE FALSE FALSE
## Mongolia FALSE FALSE FALSE
## Thailand FALSE FALSE FALSE
## Libya FALSE FALSE FALSE
## Tunisia FALSE FALSE FALSE
## Colombia FALSE FALSE FALSE
## Jamaica FALSE FALSE FALSE
## Tonga FALSE FALSE FALSE
## Belize FALSE FALSE FALSE
## Dominican Republic FALSE FALSE FALSE
## Suriname FALSE FALSE FALSE
## Maldives FALSE FALSE FALSE
## Samoa FALSE FALSE FALSE
## Botswana FALSE FALSE FALSE
## Moldova (Republic of) FALSE FALSE FALSE
## Egypt FALSE FALSE FALSE
## Gabon FALSE FALSE FALSE
## Indonesia FALSE FALSE FALSE
## Paraguay FALSE FALSE FALSE
## Philippines FALSE FALSE FALSE
## El Salvador FALSE FALSE FALSE
## South Africa FALSE FALSE FALSE
## Viet Nam FALSE FALSE FALSE
## Bolivia (Plurinational State of) FALSE FALSE FALSE
## Kyrgyzstan FALSE FALSE FALSE
## Iraq FALSE FALSE FALSE
## Guyana FALSE FALSE FALSE
## Nicaragua FALSE FALSE FALSE
## Morocco FALSE FALSE FALSE
## Namibia FALSE FALSE FALSE
## Guatemala FALSE FALSE FALSE
## Tajikistan FALSE FALSE FALSE
## India FALSE FALSE FALSE
## Honduras FALSE FALSE FALSE
## Bhutan FALSE FALSE FALSE
## Syrian Arab Republic FALSE FALSE FALSE
## Congo FALSE FALSE FALSE
## Zambia FALSE FALSE FALSE
## Ghana FALSE FALSE FALSE
## Bangladesh FALSE FALSE FALSE
## Cambodia FALSE FALSE FALSE
## Kenya FALSE FALSE FALSE
## Nepal FALSE FALSE FALSE
## Pakistan FALSE FALSE FALSE
## Myanmar FALSE FALSE FALSE
## Swaziland FALSE FALSE FALSE
## Tanzania (United Republic of) FALSE FALSE FALSE
## Cameroon FALSE FALSE FALSE
## Zimbabwe FALSE FALSE FALSE
## Mauritania FALSE FALSE FALSE
## Papua New Guinea FALSE FALSE FALSE
## Yemen FALSE FALSE FALSE
## Lesotho FALSE FALSE FALSE
## Togo FALSE FALSE FALSE
## Haiti FALSE FALSE FALSE
## Rwanda FALSE FALSE FALSE
## Uganda FALSE FALSE FALSE
## Benin FALSE FALSE FALSE
## Sudan FALSE FALSE FALSE
## Senegal FALSE FALSE FALSE
## Afghanistan FALSE FALSE FALSE
## Côte d'Ivoire FALSE FALSE FALSE
## Malawi FALSE FALSE FALSE
## Ethiopia FALSE FALSE FALSE
## Gambia FALSE FALSE FALSE
## Congo (Democratic Republic of the) FALSE FALSE FALSE
## Liberia FALSE FALSE FALSE
## Mali FALSE FALSE FALSE
## Mozambique FALSE FALSE FALSE
## Sierra Leone FALSE FALSE FALSE
## Burkina Faso FALSE FALSE FALSE
## Burundi FALSE FALSE FALSE
## Chad FALSE FALSE FALSE
## Central African Republic FALSE FALSE FALSE
## Niger FALSE FALSE FALSE
## GII.Rank GII Mat.Mor Ado.Birth
## Norway FALSE FALSE FALSE FALSE
## Australia FALSE FALSE FALSE FALSE
## Switzerland FALSE FALSE FALSE FALSE
## Denmark FALSE FALSE FALSE FALSE
## Netherlands FALSE FALSE FALSE FALSE
## Germany FALSE FALSE FALSE FALSE
## Ireland FALSE FALSE FALSE FALSE
## United States FALSE FALSE FALSE FALSE
## Canada FALSE FALSE FALSE FALSE
## New Zealand FALSE FALSE FALSE FALSE
## Singapore FALSE FALSE FALSE FALSE
## Sweden FALSE FALSE FALSE FALSE
## United Kingdom FALSE FALSE FALSE FALSE
## Iceland FALSE FALSE FALSE FALSE
## Korea (Republic of) FALSE FALSE FALSE FALSE
## Israel FALSE FALSE FALSE FALSE
## Luxembourg FALSE FALSE FALSE FALSE
## Japan FALSE FALSE FALSE FALSE
## Belgium FALSE FALSE FALSE FALSE
## France FALSE FALSE FALSE FALSE
## Austria FALSE FALSE FALSE FALSE
## Finland FALSE FALSE FALSE FALSE
## Slovenia FALSE FALSE FALSE FALSE
## Spain FALSE FALSE FALSE FALSE
## Italy FALSE FALSE FALSE FALSE
## Czech Republic FALSE FALSE FALSE FALSE
## Greece FALSE FALSE FALSE FALSE
## Estonia FALSE FALSE FALSE FALSE
## Cyprus FALSE FALSE FALSE FALSE
## Qatar FALSE FALSE FALSE FALSE
## Slovakia FALSE FALSE FALSE FALSE
## Poland FALSE FALSE FALSE FALSE
## Lithuania FALSE FALSE FALSE FALSE
## Malta FALSE FALSE FALSE FALSE
## Saudi Arabia FALSE FALSE FALSE FALSE
## Argentina FALSE FALSE FALSE FALSE
## United Arab Emirates FALSE FALSE FALSE FALSE
## Chile FALSE FALSE FALSE FALSE
## Portugal FALSE FALSE FALSE FALSE
## Hungary FALSE FALSE FALSE FALSE
## Bahrain FALSE FALSE FALSE FALSE
## Latvia FALSE FALSE FALSE FALSE
## Croatia FALSE FALSE FALSE FALSE
## Kuwait FALSE FALSE FALSE FALSE
## Montenegro FALSE FALSE FALSE FALSE
## Belarus FALSE FALSE FALSE FALSE
## Russian Federation FALSE FALSE FALSE FALSE
## Oman FALSE FALSE FALSE FALSE
## Romania FALSE FALSE FALSE FALSE
## Uruguay FALSE FALSE FALSE FALSE
## Bahamas FALSE FALSE FALSE FALSE
## Kazakhstan FALSE FALSE FALSE FALSE
## Barbados FALSE FALSE FALSE FALSE
## Bulgaria FALSE FALSE FALSE FALSE
## Panama FALSE FALSE FALSE FALSE
## Malaysia FALSE FALSE FALSE FALSE
## Mauritius FALSE FALSE FALSE FALSE
## Trinidad and Tobago FALSE FALSE FALSE FALSE
## Serbia FALSE FALSE FALSE FALSE
## Cuba FALSE FALSE FALSE FALSE
## Lebanon FALSE FALSE FALSE FALSE
## Costa Rica FALSE FALSE FALSE FALSE
## Iran (Islamic Republic of) FALSE FALSE FALSE FALSE
## Venezuela (Bolivarian Republic of) FALSE FALSE FALSE FALSE
## Turkey FALSE FALSE FALSE FALSE
## Sri Lanka FALSE FALSE FALSE FALSE
## Mexico FALSE FALSE FALSE FALSE
## Brazil FALSE FALSE FALSE FALSE
## Georgia FALSE FALSE FALSE FALSE
## Azerbaijan FALSE FALSE FALSE FALSE
## Jordan FALSE FALSE FALSE FALSE
## The former Yugoslav Republic of Macedonia FALSE FALSE FALSE FALSE
## Ukraine FALSE FALSE FALSE FALSE
## Algeria FALSE FALSE FALSE FALSE
## Peru FALSE FALSE FALSE FALSE
## Albania FALSE FALSE FALSE FALSE
## Armenia FALSE FALSE FALSE FALSE
## Bosnia and Herzegovina FALSE FALSE FALSE FALSE
## Ecuador FALSE FALSE FALSE FALSE
## China FALSE FALSE FALSE FALSE
## Fiji FALSE FALSE FALSE FALSE
## Mongolia FALSE FALSE FALSE FALSE
## Thailand FALSE FALSE FALSE FALSE
## Libya FALSE FALSE FALSE FALSE
## Tunisia FALSE FALSE FALSE FALSE
## Colombia FALSE FALSE FALSE FALSE
## Jamaica FALSE FALSE FALSE FALSE
## Tonga FALSE FALSE FALSE FALSE
## Belize FALSE FALSE FALSE FALSE
## Dominican Republic FALSE FALSE FALSE FALSE
## Suriname FALSE FALSE FALSE FALSE
## Maldives FALSE FALSE FALSE FALSE
## Samoa FALSE FALSE FALSE FALSE
## Botswana FALSE FALSE FALSE FALSE
## Moldova (Republic of) FALSE FALSE FALSE FALSE
## Egypt FALSE FALSE FALSE FALSE
## Gabon FALSE FALSE FALSE FALSE
## Indonesia FALSE FALSE FALSE FALSE
## Paraguay FALSE FALSE FALSE FALSE
## Philippines FALSE FALSE FALSE FALSE
## El Salvador FALSE FALSE FALSE FALSE
## South Africa FALSE FALSE FALSE FALSE
## Viet Nam FALSE FALSE FALSE FALSE
## Bolivia (Plurinational State of) FALSE FALSE FALSE FALSE
## Kyrgyzstan FALSE FALSE FALSE FALSE
## Iraq FALSE FALSE FALSE FALSE
## Guyana FALSE FALSE FALSE FALSE
## Nicaragua FALSE FALSE FALSE FALSE
## Morocco FALSE FALSE FALSE FALSE
## Namibia FALSE FALSE FALSE FALSE
## Guatemala FALSE FALSE FALSE FALSE
## Tajikistan FALSE FALSE FALSE FALSE
## India FALSE FALSE FALSE FALSE
## Honduras FALSE FALSE FALSE FALSE
## Bhutan FALSE FALSE FALSE FALSE
## Syrian Arab Republic FALSE FALSE FALSE FALSE
## Congo FALSE FALSE FALSE FALSE
## Zambia FALSE FALSE FALSE FALSE
## Ghana FALSE FALSE FALSE FALSE
## Bangladesh FALSE FALSE FALSE FALSE
## Cambodia FALSE FALSE FALSE FALSE
## Kenya FALSE FALSE FALSE FALSE
## Nepal FALSE FALSE FALSE FALSE
## Pakistan FALSE FALSE FALSE FALSE
## Myanmar FALSE FALSE FALSE FALSE
## Swaziland FALSE FALSE FALSE FALSE
## Tanzania (United Republic of) FALSE FALSE FALSE FALSE
## Cameroon FALSE FALSE FALSE FALSE
## Zimbabwe FALSE FALSE FALSE FALSE
## Mauritania FALSE FALSE FALSE FALSE
## Papua New Guinea FALSE FALSE FALSE FALSE
## Yemen FALSE FALSE FALSE FALSE
## Lesotho FALSE FALSE FALSE FALSE
## Togo FALSE FALSE FALSE FALSE
## Haiti FALSE FALSE FALSE FALSE
## Rwanda FALSE FALSE FALSE FALSE
## Uganda FALSE FALSE FALSE FALSE
## Benin FALSE FALSE FALSE FALSE
## Sudan FALSE FALSE FALSE FALSE
## Senegal FALSE FALSE FALSE FALSE
## Afghanistan FALSE FALSE FALSE FALSE
## Côte d'Ivoire FALSE FALSE FALSE FALSE
## Malawi FALSE FALSE FALSE FALSE
## Ethiopia FALSE FALSE FALSE FALSE
## Gambia FALSE FALSE FALSE FALSE
## Congo (Democratic Republic of the) FALSE FALSE FALSE FALSE
## Liberia FALSE FALSE FALSE FALSE
## Mali FALSE FALSE FALSE FALSE
## Mozambique FALSE FALSE FALSE FALSE
## Sierra Leone FALSE FALSE FALSE FALSE
## Burkina Faso FALSE FALSE FALSE FALSE
## Burundi FALSE FALSE FALSE FALSE
## Chad FALSE FALSE FALSE FALSE
## Central African Republic FALSE FALSE FALSE FALSE
## Niger FALSE FALSE FALSE FALSE
## Parli.F Edu2.F Edu2.M Labo.F Labo.M
## Norway FALSE FALSE FALSE FALSE FALSE
## Australia FALSE FALSE FALSE FALSE FALSE
## Switzerland FALSE FALSE FALSE FALSE FALSE
## Denmark FALSE FALSE FALSE FALSE FALSE
## Netherlands FALSE FALSE FALSE FALSE FALSE
## Germany FALSE FALSE FALSE FALSE FALSE
## Ireland FALSE FALSE FALSE FALSE FALSE
## United States FALSE FALSE FALSE FALSE FALSE
## Canada FALSE FALSE FALSE FALSE FALSE
## New Zealand FALSE FALSE FALSE FALSE FALSE
## Singapore FALSE FALSE FALSE FALSE FALSE
## Sweden FALSE FALSE FALSE FALSE FALSE
## United Kingdom FALSE FALSE FALSE FALSE FALSE
## Iceland FALSE FALSE FALSE FALSE FALSE
## Korea (Republic of) FALSE FALSE FALSE FALSE FALSE
## Israel FALSE FALSE FALSE FALSE FALSE
## Luxembourg FALSE FALSE FALSE FALSE FALSE
## Japan FALSE FALSE FALSE FALSE FALSE
## Belgium FALSE FALSE FALSE FALSE FALSE
## France FALSE FALSE FALSE FALSE FALSE
## Austria FALSE FALSE FALSE FALSE FALSE
## Finland FALSE FALSE FALSE FALSE FALSE
## Slovenia FALSE FALSE FALSE FALSE FALSE
## Spain FALSE FALSE FALSE FALSE FALSE
## Italy FALSE FALSE FALSE FALSE FALSE
## Czech Republic FALSE FALSE FALSE FALSE FALSE
## Greece FALSE FALSE FALSE FALSE FALSE
## Estonia FALSE FALSE FALSE FALSE FALSE
## Cyprus FALSE FALSE FALSE FALSE FALSE
## Qatar FALSE FALSE FALSE FALSE FALSE
## Slovakia FALSE FALSE FALSE FALSE FALSE
## Poland FALSE FALSE FALSE FALSE FALSE
## Lithuania FALSE FALSE FALSE FALSE FALSE
## Malta FALSE FALSE FALSE FALSE FALSE
## Saudi Arabia FALSE FALSE FALSE FALSE FALSE
## Argentina FALSE FALSE FALSE FALSE FALSE
## United Arab Emirates FALSE FALSE FALSE FALSE FALSE
## Chile FALSE FALSE FALSE FALSE FALSE
## Portugal FALSE FALSE FALSE FALSE FALSE
## Hungary FALSE FALSE FALSE FALSE FALSE
## Bahrain FALSE FALSE FALSE FALSE FALSE
## Latvia FALSE FALSE FALSE FALSE FALSE
## Croatia FALSE FALSE FALSE FALSE FALSE
## Kuwait FALSE FALSE FALSE FALSE FALSE
## Montenegro FALSE FALSE FALSE FALSE FALSE
## Belarus FALSE FALSE FALSE FALSE FALSE
## Russian Federation FALSE FALSE FALSE FALSE FALSE
## Oman FALSE FALSE FALSE FALSE FALSE
## Romania FALSE FALSE FALSE FALSE FALSE
## Uruguay FALSE FALSE FALSE FALSE FALSE
## Bahamas FALSE FALSE FALSE FALSE FALSE
## Kazakhstan FALSE FALSE FALSE FALSE FALSE
## Barbados FALSE FALSE FALSE FALSE FALSE
## Bulgaria FALSE FALSE FALSE FALSE FALSE
## Panama FALSE FALSE FALSE FALSE FALSE
## Malaysia FALSE FALSE FALSE FALSE FALSE
## Mauritius FALSE FALSE FALSE FALSE FALSE
## Trinidad and Tobago FALSE FALSE FALSE FALSE FALSE
## Serbia FALSE FALSE FALSE FALSE FALSE
## Cuba FALSE FALSE FALSE FALSE FALSE
## Lebanon FALSE FALSE FALSE FALSE FALSE
## Costa Rica FALSE FALSE FALSE FALSE FALSE
## Iran (Islamic Republic of) FALSE FALSE FALSE FALSE FALSE
## Venezuela (Bolivarian Republic of) FALSE FALSE FALSE FALSE FALSE
## Turkey FALSE FALSE FALSE FALSE FALSE
## Sri Lanka FALSE FALSE FALSE FALSE FALSE
## Mexico FALSE FALSE FALSE FALSE FALSE
## Brazil FALSE FALSE FALSE FALSE FALSE
## Georgia FALSE FALSE FALSE FALSE FALSE
## Azerbaijan FALSE FALSE FALSE FALSE FALSE
## Jordan FALSE FALSE FALSE FALSE FALSE
## The former Yugoslav Republic of Macedonia FALSE FALSE FALSE FALSE FALSE
## Ukraine FALSE FALSE FALSE FALSE FALSE
## Algeria FALSE FALSE FALSE FALSE FALSE
## Peru FALSE FALSE FALSE FALSE FALSE
## Albania FALSE FALSE FALSE FALSE FALSE
## Armenia FALSE FALSE FALSE FALSE FALSE
## Bosnia and Herzegovina FALSE FALSE FALSE FALSE FALSE
## Ecuador FALSE FALSE FALSE FALSE FALSE
## China FALSE FALSE FALSE FALSE FALSE
## Fiji FALSE FALSE FALSE FALSE FALSE
## Mongolia FALSE FALSE FALSE FALSE FALSE
## Thailand FALSE FALSE FALSE FALSE FALSE
## Libya FALSE FALSE FALSE FALSE FALSE
## Tunisia FALSE FALSE FALSE FALSE FALSE
## Colombia FALSE FALSE FALSE FALSE FALSE
## Jamaica FALSE FALSE FALSE FALSE FALSE
## Tonga FALSE FALSE FALSE FALSE FALSE
## Belize FALSE FALSE FALSE FALSE FALSE
## Dominican Republic FALSE FALSE FALSE FALSE FALSE
## Suriname FALSE FALSE FALSE FALSE FALSE
## Maldives FALSE FALSE FALSE FALSE FALSE
## Samoa FALSE FALSE FALSE FALSE FALSE
## Botswana FALSE FALSE FALSE FALSE FALSE
## Moldova (Republic of) FALSE FALSE FALSE FALSE FALSE
## Egypt FALSE FALSE FALSE FALSE FALSE
## Gabon FALSE FALSE FALSE FALSE FALSE
## Indonesia FALSE FALSE FALSE FALSE FALSE
## Paraguay FALSE FALSE FALSE FALSE FALSE
## Philippines FALSE FALSE FALSE FALSE FALSE
## El Salvador FALSE FALSE FALSE FALSE FALSE
## South Africa FALSE FALSE FALSE FALSE FALSE
## Viet Nam FALSE FALSE FALSE FALSE FALSE
## Bolivia (Plurinational State of) FALSE FALSE FALSE FALSE FALSE
## Kyrgyzstan FALSE FALSE FALSE FALSE FALSE
## Iraq FALSE FALSE FALSE FALSE FALSE
## Guyana FALSE FALSE FALSE FALSE FALSE
## Nicaragua FALSE FALSE FALSE FALSE FALSE
## Morocco FALSE FALSE FALSE FALSE FALSE
## Namibia FALSE FALSE FALSE FALSE FALSE
## Guatemala FALSE FALSE FALSE FALSE FALSE
## Tajikistan FALSE FALSE FALSE FALSE FALSE
## India FALSE FALSE FALSE FALSE FALSE
## Honduras FALSE FALSE FALSE FALSE FALSE
## Bhutan FALSE FALSE FALSE FALSE FALSE
## Syrian Arab Republic FALSE FALSE FALSE FALSE FALSE
## Congo FALSE FALSE FALSE FALSE FALSE
## Zambia FALSE FALSE FALSE FALSE FALSE
## Ghana FALSE FALSE FALSE FALSE FALSE
## Bangladesh FALSE FALSE FALSE FALSE FALSE
## Cambodia FALSE FALSE FALSE FALSE FALSE
## Kenya FALSE FALSE FALSE FALSE FALSE
## Nepal FALSE FALSE FALSE FALSE FALSE
## Pakistan FALSE FALSE FALSE FALSE FALSE
## Myanmar FALSE FALSE FALSE FALSE FALSE
## Swaziland FALSE FALSE FALSE FALSE FALSE
## Tanzania (United Republic of) FALSE FALSE FALSE FALSE FALSE
## Cameroon FALSE FALSE FALSE FALSE FALSE
## Zimbabwe FALSE FALSE FALSE FALSE FALSE
## Mauritania FALSE FALSE FALSE FALSE FALSE
## Papua New Guinea FALSE FALSE FALSE FALSE FALSE
## Yemen FALSE FALSE FALSE FALSE FALSE
## Lesotho FALSE FALSE FALSE FALSE FALSE
## Togo FALSE FALSE FALSE FALSE FALSE
## Haiti FALSE FALSE FALSE FALSE FALSE
## Rwanda FALSE FALSE FALSE FALSE FALSE
## Uganda FALSE FALSE FALSE FALSE FALSE
## Benin FALSE FALSE FALSE FALSE FALSE
## Sudan FALSE FALSE FALSE FALSE FALSE
## Senegal FALSE FALSE FALSE FALSE FALSE
## Afghanistan FALSE FALSE FALSE FALSE FALSE
## Côte d'Ivoire FALSE FALSE FALSE FALSE FALSE
## Malawi FALSE FALSE FALSE FALSE FALSE
## Ethiopia FALSE FALSE FALSE FALSE FALSE
## Gambia FALSE FALSE FALSE FALSE FALSE
## Congo (Democratic Republic of the) FALSE FALSE FALSE FALSE FALSE
## Liberia FALSE FALSE FALSE FALSE FALSE
## Mali FALSE FALSE FALSE FALSE FALSE
## Mozambique FALSE FALSE FALSE FALSE FALSE
## Sierra Leone FALSE FALSE FALSE FALSE FALSE
## Burkina Faso FALSE FALSE FALSE FALSE FALSE
## Burundi FALSE FALSE FALSE FALSE FALSE
## Chad FALSE FALSE FALSE FALSE FALSE
## Central African Republic FALSE FALSE FALSE FALSE FALSE
## Niger FALSE FALSE FALSE FALSE FALSE
## Edu2.FM Labo.FM
## Norway FALSE FALSE
## Australia FALSE FALSE
## Switzerland FALSE FALSE
## Denmark FALSE FALSE
## Netherlands FALSE FALSE
## Germany FALSE FALSE
## Ireland FALSE FALSE
## United States FALSE FALSE
## Canada FALSE FALSE
## New Zealand FALSE FALSE
## Singapore FALSE FALSE
## Sweden FALSE FALSE
## United Kingdom FALSE FALSE
## Iceland FALSE FALSE
## Korea (Republic of) FALSE FALSE
## Israel FALSE FALSE
## Luxembourg FALSE FALSE
## Japan FALSE FALSE
## Belgium FALSE FALSE
## France FALSE FALSE
## Austria FALSE FALSE
## Finland FALSE FALSE
## Slovenia FALSE FALSE
## Spain FALSE FALSE
## Italy FALSE FALSE
## Czech Republic FALSE FALSE
## Greece FALSE FALSE
## Estonia FALSE FALSE
## Cyprus FALSE FALSE
## Qatar FALSE FALSE
## Slovakia FALSE FALSE
## Poland FALSE FALSE
## Lithuania FALSE FALSE
## Malta FALSE FALSE
## Saudi Arabia FALSE FALSE
## Argentina FALSE FALSE
## United Arab Emirates FALSE FALSE
## Chile FALSE FALSE
## Portugal FALSE FALSE
## Hungary FALSE FALSE
## Bahrain FALSE FALSE
## Latvia FALSE FALSE
## Croatia FALSE FALSE
## Kuwait FALSE FALSE
## Montenegro FALSE FALSE
## Belarus FALSE FALSE
## Russian Federation FALSE FALSE
## Oman FALSE FALSE
## Romania FALSE FALSE
## Uruguay FALSE FALSE
## Bahamas FALSE FALSE
## Kazakhstan FALSE FALSE
## Barbados FALSE FALSE
## Bulgaria FALSE FALSE
## Panama FALSE FALSE
## Malaysia FALSE FALSE
## Mauritius FALSE FALSE
## Trinidad and Tobago FALSE FALSE
## Serbia FALSE FALSE
## Cuba FALSE FALSE
## Lebanon FALSE FALSE
## Costa Rica FALSE FALSE
## Iran (Islamic Republic of) FALSE FALSE
## Venezuela (Bolivarian Republic of) FALSE FALSE
## Turkey FALSE FALSE
## Sri Lanka FALSE FALSE
## Mexico FALSE FALSE
## Brazil FALSE FALSE
## Georgia FALSE FALSE
## Azerbaijan FALSE FALSE
## Jordan FALSE FALSE
## The former Yugoslav Republic of Macedonia FALSE FALSE
## Ukraine FALSE FALSE
## Algeria FALSE FALSE
## Peru FALSE FALSE
## Albania FALSE FALSE
## Armenia FALSE FALSE
## Bosnia and Herzegovina FALSE FALSE
## Ecuador FALSE FALSE
## China FALSE FALSE
## Fiji FALSE FALSE
## Mongolia FALSE FALSE
## Thailand FALSE FALSE
## Libya FALSE FALSE
## Tunisia FALSE FALSE
## Colombia FALSE FALSE
## Jamaica FALSE FALSE
## Tonga FALSE FALSE
## Belize FALSE FALSE
## Dominican Republic FALSE FALSE
## Suriname FALSE FALSE
## Maldives FALSE FALSE
## Samoa FALSE FALSE
## Botswana FALSE FALSE
## Moldova (Republic of) FALSE FALSE
## Egypt FALSE FALSE
## Gabon FALSE FALSE
## Indonesia FALSE FALSE
## Paraguay FALSE FALSE
## Philippines FALSE FALSE
## El Salvador FALSE FALSE
## South Africa FALSE FALSE
## Viet Nam FALSE FALSE
## Bolivia (Plurinational State of) FALSE FALSE
## Kyrgyzstan FALSE FALSE
## Iraq FALSE FALSE
## Guyana FALSE FALSE
## Nicaragua FALSE FALSE
## Morocco FALSE FALSE
## Namibia FALSE FALSE
## Guatemala FALSE FALSE
## Tajikistan FALSE FALSE
## India FALSE FALSE
## Honduras FALSE FALSE
## Bhutan FALSE FALSE
## Syrian Arab Republic FALSE FALSE
## Congo FALSE FALSE
## Zambia FALSE FALSE
## Ghana FALSE FALSE
## Bangladesh FALSE FALSE
## Cambodia FALSE FALSE
## Kenya FALSE FALSE
## Nepal FALSE FALSE
## Pakistan FALSE FALSE
## Myanmar FALSE FALSE
## Swaziland FALSE FALSE
## Tanzania (United Republic of) FALSE FALSE
## Cameroon FALSE FALSE
## Zimbabwe FALSE FALSE
## Mauritania FALSE FALSE
## Papua New Guinea FALSE FALSE
## Yemen FALSE FALSE
## Lesotho FALSE FALSE
## Togo FALSE FALSE
## Haiti FALSE FALSE
## Rwanda FALSE FALSE
## Uganda FALSE FALSE
## Benin FALSE FALSE
## Sudan FALSE FALSE
## Senegal FALSE FALSE
## Afghanistan FALSE FALSE
## Côte d'Ivoire FALSE FALSE
## Malawi FALSE FALSE
## Ethiopia FALSE FALSE
## Gambia FALSE FALSE
## Congo (Democratic Republic of the) FALSE FALSE
## Liberia FALSE FALSE
## Mali FALSE FALSE
## Mozambique FALSE FALSE
## Sierra Leone FALSE FALSE
## Burkina Faso FALSE FALSE
## Burundi FALSE FALSE
## Chad FALSE FALSE
## Central African Republic FALSE FALSE
## Niger FALSE FALSE
# histogram per variable
pivot_longer(human_new, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# charts and correlations - data is not very visible
ggpairs(human_new,lower=list(combo=wrap("facethist",binwidth=0.1)))
# correlation between variables
c <- cor(human_new)
corrplot(c, method="circle")
HDI, Life exp, edu exp, edu mean is negatively correlated with GII, Mat
mortality, ado birth HDI, Life exp; edu exp, edu mean is positively
correlated with edu 2 of male and female, GII, ado birth and mat
mortality are positvely correlated GII and math mortality are negatively
correlated with edu for female an male
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
# pca
pca_human <- prcomp(human_new)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)
# standardize human_new
scaled_human <- scale(human_new)
# new PCA
pca_human2 <- prcomp(scaled_human)
# draw a biplot of the principal component representation and the original variables with the scaled data
biplot(pca_human2, choices = 1:2)
# Question: How to make it readable? what is expected in this question?
# Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
# yes because each variable have different weight in the model, that is why it is necessary to scale / standardize the data
# Let's reduce the dataset with only the last 30 countries and few of the variables.
scaled_human2 <- scaled_human %>% tail(n=30)
scaled_human2 <- as.data.frame(scaled_human2)
scaled_human2 %>% dplyr::select(HDI,GNI,GII,Edu.Exp,Life.Exp,Mat.Mor,Ado.Birth,Edu2.F,Edu2.M)
## HDI GNI GII Edu.Exp
## Swaziland -1.116487 -0.6517472 0.9982202 -0.6605505
## Tanzania (United Republic of) -1.181055 -0.8205902 0.9459891 -1.3999219
## Cameroon -1.239166 -0.7994511 1.1549134 -0.9774240
## Zimbabwe -1.258537 -0.8635155 0.7213955 -0.8013832
## Mauritania -1.277907 -0.7586289 1.2750449 -1.6463790
## Papua New Guinea -1.284364 -0.8177860 1.2802680 -1.1534648
## Yemen -1.329562 -0.7608399 1.9749414 -1.3999219
## Lesotho -1.336019 -0.7723262 0.9146505 -0.7309668
## Togo -1.419957 -0.8843849 1.1601365 -0.3436771
## Haiti -1.426414 -0.8606034 1.2384832 -1.5759627
## Rwanda -1.426414 -0.8719819 0.1781922 -1.0126321
## Uganda -1.426414 -0.8636233 0.8989811 -1.1886729
## Benin -1.445784 -0.8553187 1.2959373 -0.7309668
## Sudan -1.452241 -0.7452013 1.1758059 -2.1745014
## Senegal -1.536180 -0.8326157 0.8467501 -1.8576279
## Afghanistan -1.542637 -0.8489554 1.7085629 -1.3647137
## Côte d'Ivoire -1.562007 -0.7796062 1.6354394 -1.5055464
## Malawi -1.671773 -0.9103234 1.2802680 -0.8365913
## Ethiopia -1.691143 -0.8735997 1.0034433 -1.6463790
## Gambia -1.697600 -0.8693395 1.3377222 -1.5407545
## Congo (Democratic Republic of the) -1.749255 -0.9139365 1.6041007 -1.1886729
## Liberia -1.768625 -0.9071957 1.4891923 -1.2942974
## Mali -1.839650 -0.8652411 1.6249931 -1.6815871
## Mozambique -1.859020 -0.8900472 1.1758059 -1.3647137
## Sierra Leone -1.878391 -0.8546176 1.4839692 -1.6111708
## Burkina Faso -1.949416 -0.8648097 1.3847302 -1.8928361
## Burundi -1.962329 -0.9097302 0.6587182 -1.0830484
## Chad -2.013984 -0.8381701 1.7764633 -2.0336687
## Central African Republic -2.285170 -0.9192752 1.5100848 -2.1040851
## Niger -2.298084 -0.9016413 1.8130250 -2.7378319
## Life.Exp Mat.Mor Ado.Birth Edu2.F
## Swaziland -2.7188401 0.7597924 0.604233786 -1.0822449
## Tanzania (United Republic of) -0.7985475 1.2319592 1.837448830 -1.6092561
## Cameroon -1.9387212 2.0818593 1.669614830 -1.1016441
## Zimbabwe -1.6986846 1.5152592 0.319645699 -0.2157481
## Mauritania -1.0265822 0.8070091 0.635854684 -1.5219597
## Papua New Guinea -1.0865914 0.3348424 0.363428481 -1.5445921
## Yemen -0.9425694 0.5709257 -0.003860417 -1.5122601
## Lesotho -2.6228255 1.6096926 1.027467351 -1.0822449
## Togo -1.4346444 1.4208259 1.078547264 -1.2697704
## Haiti -1.0625877 1.0903092 -0.125479258 -1.0660789
## Rwanda -0.8945621 0.8070091 -0.329798910 -1.5316593
## Uganda -1.5786664 0.9958758 1.932311526 -1.0499130
## Benin -1.4466462 0.9014425 1.046926366 -1.4249638
## Sudan -0.9785749 0.9958758 0.896119003 -1.3990982
## Senegal -0.6185201 0.8070091 1.149086192 -1.5575249
## Afghanistan -1.3506316 1.1847425 0.964225554 -1.5995565
## Côte d'Ivoire -2.4187944 2.6956761 2.022309468 -1.3376675
## Malawi -1.0625877 1.7041259 2.375004106 -1.4314302
## Ethiopia -0.9065639 1.2791758 0.759905902 -1.5381257
## Gambia -1.3746353 1.3263925 1.669614830 -1.2277388
## Congo (Democratic Republic of the) -1.5546627 2.7428927 2.143928308 -1.3764659
## Liberia -1.2906225 2.3179427 1.708532859 -1.2924027
## Mali -1.6386755 1.8929926 3.124176164 -1.5413589
## Mozambique -1.9867285 1.5624759 2.204737729 -1.7450503
## Sierra Leone -2.4908053 4.4899097 1.302325931 -1.4669954
## Burkina Faso -1.5546627 1.1847425 1.659885323 -1.7612163
## Burundi -1.7946993 2.7901094 -0.410067345 -1.6189556
## Chad -2.4067925 3.9233096 2.550135236 -1.7353507
## Central African Republic -2.5148090 3.4511428 1.243948888 -1.4637622
## Niger -1.2306133 2.2707260 3.834430193 -1.7127184
## Edu2.M
## Swaziland -1.23222954
## Tanzania (United Republic of) -1.81851973
## Cameroon -0.91598816
## Zimbabwe 0.04694906
## Mauritania -1.41344651
## Papua New Guinea -1.64085604
## Yemen -1.20735662
## Lesotho -1.48095871
## Togo -0.72411137
## Haiti -0.90532834
## Rwanda -1.84339265
## Uganda -0.96573400
## Benin -1.19669680
## Sudan -1.50938490
## Senegal -1.60887657
## Afghanistan -1.09720513
## Côte d'Ivoire -1.08654531
## Malawi -1.38857359
## Ethiopia -1.50938490
## Gambia -1.03679948
## Congo (Democratic Republic of the) -1.00482001
## Liberia -0.75964411
## Mali -1.61953640
## Mozambique -1.93577777
## Sierra Leone -1.38502032
## Burkina Faso -2.04237599
## Burundi -1.86115902
## Chad -1.80430664
## Central African Republic -1.20735662
## Niger -1.87892539
pca_human3 <- prcomp(scaled_human2)
biplot(pca_human3, choices = 1:2)
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
# Question: How to analyze when we cannot read it well?
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE) Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
tea_new <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea_new)
str(tea_new)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# variables are factorial, except few ones. let's map the different values with MCA
# But first let's keep only the categorical variables. We could also change the variables to categorical ones.
tea_new_2 <- tea_new %>% dplyr::select(-breakfast,-age)
library(ggplot2)
pivot_longer(tea_new_2, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
# too many variables, it is hard to see.
# Question: how to select the variables we want to use? should they have less correlation?
library(FactoMineR)
mca <- MCA(tea_new_2, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_new_2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.092 0.084 0.072 0.060 0.057 0.055 0.050
## % of var. 5.913 5.392 4.629 3.881 3.672 3.502 3.188
## Cumulative % of var. 5.913 11.305 15.934 19.815 23.488 26.989 30.177
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.049 0.048 0.045 0.042 0.042 0.039 0.038
## % of var. 3.138 3.050 2.885 2.675 2.672 2.491 2.424
## Cumulative % of var. 33.315 36.365 39.251 41.926 44.598 47.089 49.513
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.036 0.034 0.032 0.032 0.031 0.030
## % of var. 2.320 2.305 2.173 2.077 2.049 2.016 1.944
## Cumulative % of var. 51.833 54.138 56.311 58.387 60.436 62.453 64.397
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.029 0.027 0.026 0.026 0.025 0.024 0.024
## % of var. 1.880 1.761 1.692 1.669 1.624 1.553 1.544
## Cumulative % of var. 66.276 68.038 69.730 71.399 73.023 74.577 76.120
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.022 0.021 0.020 0.019 0.019
## % of var. 1.456 1.426 1.404 1.339 1.264 1.246 1.221
## Cumulative % of var. 77.576 79.002 80.406 81.745 83.009 84.255 85.476
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.017 0.017 0.017 0.016 0.016 0.015 0.015
## % of var. 1.120 1.112 1.078 1.028 1.005 0.968 0.939
## Cumulative % of var. 86.596 87.708 88.786 89.814 90.819 91.786 92.725
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.013 0.013 0.012 0.012 0.011 0.010 0.010
## % of var. 0.866 0.837 0.802 0.739 0.697 0.674 0.666
## Cumulative % of var. 93.590 94.427 95.229 95.968 96.666 97.339 98.005
## Dim.50 Dim.51 Dim.52 Dim.53
## Variance 0.009 0.008 0.007 0.006
## % of var. 0.603 0.531 0.458 0.403
## Cumulative % of var. 98.608 99.139 99.597 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.629 1.432 0.202 | 0.148 0.086 0.011 | 0.109 0.055
## 2 | -0.427 0.660 0.139 | 0.288 0.330 0.064 | -0.110 0.056
## 3 | 0.097 0.034 0.006 | -0.156 0.096 0.015 | 0.125 0.073
## 4 | -0.560 1.133 0.227 | -0.279 0.308 0.056 | -0.026 0.003
## 5 | -0.173 0.108 0.028 | -0.149 0.088 0.020 | 0.027 0.003
## 6 | -0.681 1.675 0.272 | -0.293 0.340 0.050 | -0.008 0.000
## 7 | -0.223 0.181 0.037 | 0.015 0.001 0.000 | 0.183 0.155
## 8 | -0.031 0.003 0.001 | 0.111 0.048 0.009 | -0.097 0.043
## 9 | -0.063 0.014 0.003 | 0.266 0.281 0.048 | 0.388 0.697
## 10 | 0.177 0.114 0.021 | 0.369 0.539 0.090 | 0.315 0.459
## cos2
## 1 0.006 |
## 2 0.009 |
## 3 0.009 |
## 4 0.000 |
## 5 0.001 |
## 6 0.000 |
## 7 0.025 |
## 8 0.007 |
## 9 0.103 |
## 10 0.066 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## Not.tea time | -0.552 4.238 0.236 -8.396 | 0.001 0.000 0.000 0.019 |
## tea time | 0.428 3.285 0.236 8.396 | -0.001 0.000 0.000 -0.019 |
## evening | 0.296 0.961 0.046 3.704 | -0.404 1.961 0.085 -5.051 |
## Not.evening | -0.155 0.503 0.046 -3.704 | 0.211 1.025 0.085 5.051 |
## lunch | 0.590 1.630 0.060 4.231 | -0.406 0.844 0.028 -2.907 |
## Not.lunch | -0.101 0.280 0.060 -4.231 | 0.070 0.145 0.028 2.907 |
## dinner | -1.047 2.447 0.082 -4.965 | -0.080 0.016 0.000 -0.379 |
## Not.dinner | 0.079 0.184 0.082 4.965 | 0.006 0.001 0.000 0.379 |
## always | 0.342 1.281 0.061 4.276 | -0.255 0.784 0.034 -3.193 |
## Not.always | -0.179 0.670 0.061 -4.276 | 0.134 0.410 0.034 3.193 |
## Dim.3 ctr cos2 v.test
## Not.tea time 0.063 0.071 0.003 0.959 |
## tea time -0.049 0.055 0.003 -0.959 |
## evening 0.324 1.469 0.055 4.052 |
## Not.evening -0.169 0.768 0.055 -4.052 |
## lunch 0.254 0.386 0.011 1.822 |
## Not.lunch -0.044 0.066 0.011 -1.822 |
## dinner 0.744 1.581 0.042 3.531 |
## Not.dinner -0.056 0.119 0.042 -3.531 |
## always 0.095 0.126 0.005 1.186 |
## Not.always -0.050 0.066 0.005 -1.186 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.time | 0.236 0.000 0.003 |
## evening | 0.046 0.085 0.055 |
## lunch | 0.060 0.028 0.011 |
## dinner | 0.082 0.000 0.042 |
## always | 0.061 0.034 0.005 |
## home | 0.013 0.002 0.027 |
## work | 0.071 0.020 0.027 |
## tearoom | 0.326 0.021 0.027 |
## friends | 0.201 0.058 0.024 |
## resto | 0.155 0.007 0.030 |
# visualize MCA - visualize many categories from many variables, all at once to see the ones that fit together as the same behavior. e.g unpackages and tea shop will belong to the same type of user.
plot(mca, invisible=c("ind"), graph.type = "classic")
# let's now choose only few variables: SELECT OTHER VARIABLES, QUESTION: ow to analyzes the correlation between only categorical variables to then select the ones i want to study with the mca
tea_new_2 %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
facet_wrap(~ name, scales = "free") +
geom_bar() +
labs(title = "Distribution of Categorical Variables")
# Select smaller dataset to do the mca analyze
str(tea_new)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
tea_smaller <- tea_new %>% dplyr::select(exciting,relaxing,effect.on.health,sex,where,sugar)
# model
mca2 <- MCA(tea_smaller, graph = FALSE)
# summary of the model
summary(mca2)
##
## Call:
## MCA(X = tea_smaller, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.226 0.211 0.183 0.160 0.156 0.123 0.108
## % of var. 19.413 18.045 15.680 13.683 13.381 10.541 9.257
## Cumulative % of var. 19.413 37.458 53.138 66.822 80.202 90.743 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.385 0.219 0.158 | -0.267 0.113 0.076 | -0.087
## 2 | 0.091 0.012 0.009 | 0.587 0.546 0.363 | -0.549
## 3 | -0.667 0.655 0.722 | -0.123 0.024 0.024 | -0.007
## 4 | 0.093 0.013 0.011 | -0.803 1.021 0.840 | 0.058
## 5 | -0.223 0.073 0.067 | -0.384 0.234 0.198 | 0.212
## 6 | -0.223 0.073 0.067 | -0.384 0.234 0.198 | 0.212
## 7 | -0.223 0.073 0.067 | -0.384 0.234 0.198 | 0.212
## 8 | -0.667 0.655 0.722 | -0.123 0.024 0.024 | -0.007
## 9 | -0.362 0.193 0.117 | -0.051 0.004 0.002 | 0.264
## 10 | -0.362 0.193 0.117 | -0.051 0.004 0.002 | 0.264
## ctr cos2
## 1 0.014 0.008 |
## 2 0.550 0.317 |
## 3 0.000 0.000 |
## 4 0.006 0.004 |
## 5 0.082 0.060 |
## 6 0.082 0.060 |
## 7 0.082 0.060 |
## 8 0.000 0.000 |
## 9 0.127 0.062 |
## 10 0.127 0.062 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## exciting | 0.816 18.946 0.420 11.203 | 0.294 2.655
## No.exciting | -0.514 11.944 0.420 -11.203 | -0.186 1.674
## No.relaxing | 0.521 7.521 0.164 7.002 | 0.919 25.182
## relaxing | -0.315 4.545 0.164 -7.002 | -0.555 15.217
## effect on health | 0.518 4.342 0.076 4.756 | 0.504 4.430
## No.effect on health | -0.146 1.225 0.076 -4.756 | -0.142 1.249
## F | -0.516 11.626 0.389 -10.778 | 0.293 4.026
## M | 0.753 16.963 0.389 10.778 | -0.427 5.875
## chain store | 0.022 0.022 0.001 0.499 | -0.304 4.693
## chain store+tea shop | -0.377 2.717 0.050 -3.862 | 0.613 7.731
## cos2 v.test Dim.3 ctr cos2 v.test
## exciting 0.055 4.043 | -0.627 13.851 0.248 -8.609 |
## No.exciting 0.055 -4.043 | 0.395 8.732 0.248 8.609 |
## No.relaxing 0.510 12.352 | -0.231 1.831 0.032 -3.105 |
## relaxing 0.510 -12.352 | 0.140 1.106 0.032 3.105 |
## effect on health 0.072 4.631 | 0.866 15.046 0.212 7.956 |
## No.effect on health 0.072 -4.631 | -0.244 4.244 0.212 -7.956 |
## F 0.125 6.115 | -0.228 2.821 0.076 -4.772 |
## M 0.125 -6.115 | 0.333 4.116 0.076 4.772 |
## chain store 0.165 -7.017 | -0.271 4.276 0.130 -6.243 |
## chain store+tea shop 0.132 6.282 | -0.139 0.456 0.007 -1.422 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## exciting | 0.420 0.055 0.248 |
## relaxing | 0.164 0.510 0.032 |
## effect.on.health | 0.076 0.072 0.212 |
## sex | 0.389 0.125 0.076 |
## where | 0.108 0.170 0.490 |
## sugar | 0.203 0.332 0.039 |
# visualize MCA
plot(mca2, invisible=c("ind"), graph.type = "classic")
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.